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Abstract 

Very-low bandwidth video-conferencing, which is the simultaneous transmission of speech and pictures 
(face-to-face communication) of the communicating parties, is a challenging application requiring an inte¬ 
grated effort of computer vision and computer graphics. This paper consists of two major parts. First, we 
present the outline of a simple approach to video-conferencing relying on an example-based hierarchical 
image compression scheme. In particular, we discuss the use of example images as a model, the number 
of required examples, faces as a class of semi-rigid objects, a hierarchical model based on decomposition 
into different time-scales, and the decomposition of face images into patches of interest. In the second 
part, we present several algorithms for image processing and animation as well as their experimental eval¬ 
uation. Among the original contributions of this paper is an automatic algorithm for pose estimation and 
normalization. Experiments suggest interesting estimates of necessary spatial resolution and frequency 
bands. We also review and compare different algorithms for finding the nearest neighbors in a database 
for a new input as well as a generalized algorithm for blending patches of interest in order to synthesize 
new images. Extensions for image sequences are proposed together with possible extensions based on 
the interpolation techniques of Beymer, Shashua and Poggio (1993) between example images. Finally, we 
outline the possible integration of several algorithms to illustrate a simple model-based video-conference 
system. 
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Figure 1: Simple scheme to illustrate a video-conference 
system. Only one direction of data flow is indicated. For 
transmission in the reverse direction a mirror symmetric 
system is used. The dashed arrows between the video 
and the audio channel indicate a possible coupling. 


1 Introduction and motivation 

Video-conferencing requires the simultaneous transmis¬ 
sion of speech and pictures of the communicating parties 
in real time and with high quality (sdfr Figure 1). We 
choose this apparently technical problem as a paradigm 
to investigate new concepts related to the representation 
of 3-D objects without explicit volumetric or structural 
information. In particular, we present a hierarchical ar¬ 
chitecture for model-based image compression of semi¬ 
rigid objects. Human faces are typical representatives of 
this object class. 

This paper consists of two major parts. The first, part 
outlines our approach to video-conferencing (Section 3) 
and it is introduced with a brief overview of work on 
face images and especially face recognition. Section 4 
presents a specific architecture for a video-conference 
system based on the previous approach. Several algo¬ 
rithms for image processing and animation are described 
and experimental results are discussed. A novel robust 
algorithm for automatic pose estimation and normaliza¬ 
tion in the presence of strong facial expressions is pre¬ 
sented and experimental results are discussed in detail. 
At the end of Section 4, possible extensions to the video- 
conferencing scheme by interpolation between example 
images are reviewed and discussed. 


2 Related topics 

Conventional model-based image compression methods 
are briefly reviewed to emphasize their potential prob¬ 
lems and to point out their differences from the approach 
presented lien . We review work on images of faces and 
then focus our discussion on previous work on face recog¬ 
nition that is of interest for our example-based approach 
to video-conferencing. Finally, we list specific differences 
between processing for face recognition and for video- 
conferencing. 


2.1 Conventional and model-based 
compression 

The objective herfeis not to give a comprehensive survey 
of image compression techniques, but rather to sketch a 
few key ideas. 

Most existing image coding techniques are designed 
to exploit statistical correlations inherent to images or 
to time sequences. Conventional waveform coding tech¬ 
niques are known to introduce unnatural coding errors 
when reproducing images at very low bit rates. The 
source of this problem lies in the use of statistical correla¬ 
tions which are not related to image content and cannot 
make use of context information. However, such general 
techniques are very successful for lower compression ra¬ 
tios. Popular examples are JPEG (Joint Photographic 
Expert Group) for single images and MPEG for motion 
sequen&Ss, which have also been implemented in hard¬ 
ware. JPEG applies the discrete cosine transform (DCT) 
to blocks of 8 X 8 pixel and yields good results for com¬ 
pression ratios up to about. 1 : 25. The algorithm is 
symmetrical, i.e., computational costs for compression 
and for decompression are about, the same. 

It. is widely accepted that, only model-based compres¬ 
sion schemes have the potential of very high compression 
rates while retaining high image fidelity. In model-based 
coding schemes the sender (encoder) and the receiver 
(decoder) contain a. common specific model or special 
knowledge about, the objects that, are to be coded. The 
encoder analyzes the input, images and estimates model 
parameters. The decoder synthesizes and reconstructs 
images from these parameters using the internal model 
or knowledge. With this kind of coding, very low bit. 
rates can by realized since basically only the analyzed 
model parameters are transmitted. 

The general approach of such model-based coding 
techniques — being still an active research topic (cf. [1, 
20, 44, 24, 25], t.o give some examples) — for faces is 
to use a. volumetric 3-D facial model (such as polygo¬ 
nal wire frame model). Full face images under standard 
view can be projected onto the wire frame model for re¬ 
construction by using texture mapping techniques. Two 
major difficulties are inherent, to this approach. Firstly, 
the generation of realistic 3-D model for individual faces 
is very difficult, in itself. Almost, all available automatic 
techniques yield either poor results for faces or are not. 
applicable to video-conferencing because they require 
controlled or artificial conditions (structured light, illu¬ 
mination, laser scanner, marks attached to the face sur¬ 
face, etc.). Secondly, difficulties arise from the necessity 
to register the 3-D model and the images used for tex¬ 
ture mapping. Also, 3-D motion parameters have to be 
computed precisely from image data. 

2.2 Work on face images and recognition 

A good survey of the state of the art. on image process¬ 
ing of faces is given in [12]. However, most, of the work 
with faces in computer vision was done on processing for 
recognition [33, 6, 61, 36, 39, 3, 2, 4, 15, 28, 9] and some 
work on different, kinds of classification tasks [21, 22, 14]. 
Almost, all of this work treats face recognition as a. static 
problem approached by pattern recognition techniques 
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applied to single static images. Only recently the atten¬ 
tion of researchers has shifted to the temporal aspect of 
facial expressions by using optical flow in sequences of 
face images [41, 42, 25, 65]. 

We do not intend to give a comprehensive overview of 
face recognition here. Rather, we will summarize some 
ideas that are relevant to our work. Recently, a system¬ 
atic comparison of typical approaches (feature-based ver¬ 
sus template-based techniques) to face recognition was 
carried out by BrunelU & Poggio [13, 15]. Several other 
approaches to face recognition have also been presented 
(for example [6, 61, 36, 4]). 

The first approach is influenced by the work of Kanade 
[33] and uses a vector of geometrical features for recogni¬ 
tion. First the eyes are located by computing the normal¬ 
ized cross-correlation coefficient with a single eye tem¬ 
plate at different resolution scales. The image is then 
normalized in scale and translation. By independently 
localizing both eyes, small amounts of head rotation can 
be compensated by aligning the eye-to-eye axis to be hor¬ 
izontal. The vertical and horizontal integral projection of 
two directional edge maps (components of the intensity 
gradient) are used to extract features, such as position 
and size of nose and mouth, as well as eyebrow position 
and thickness. Assumptions about natural constraints 
of faces, such as bilateral symmetry and average facial 
layout, are used at this stage. A total of 35 geometrical 
features are extracted. Recognition is then performed 
with a Bayes classifier applied to the vector of geometric 
features. 

The second approach uses template matching by cor¬ 
relation and can be regarded as an extension of the pi¬ 
oneering work of Baron [6]. Images of frontal views of 
faces are normalized as described above. Each person is 
represented by four rectangular masks centered around 
the eyes, nose, and mouth, respectively. The relative po¬ 
sition of these masks is the same for all persons. The nor¬ 
malized cross-correlation coefficients of the four masks 
are computed by matching the novel image against the 
database. Recognition is done by finding the highest 
cumulative score. Some preprocessing is applied to the 
grey-level images to decrease the sensitivity of correla¬ 
tion to illumination effects. An interesting finding is that 
recognition performance is stable over a resolution range 
of 1 : 4 (within a Gaussian pyramid). This indicates that 
quite small templates can be used, thus making correla¬ 
tion feasible at very low computational cost. 

Gilbert & Yang presented a real time face recognition 
system using custom VLSI hardware [28]. Their system 
is based on the template-matching recognition scheme 
outlined by BrunelU & Poggio [13, 15]. 

In most of the work with faces the images are normal¬ 
ized so that the faces have the same position, size and 
orientation after manually locating the eyes and mouth. 
Normalization is often achieved by alignment of the T 
spanned by both eyes and the mouth. 

2.3 Face recognition vs. video-conferencing 

Since much work has been done on face recognition, we 
want to make use as much as possible of the available 
experience. On the other hand, there are several signifi¬ 


cant differences between the problems of face recognition 
and of video-conferencing. These issues have important 
implications for our approach and will be discussed in 
the sequel — Table 1 summarizes the most important 
differences. 

In face recognition the task is in general to match 
a new face image against a gallery of stored images of 
different persons. The low-level processing should ex¬ 
tract significant features and a subsequent classification 
should identify the appropriate person and should de¬ 
termine if there is a good match in the database at 
all. Recognition should be invariant against variations 
of an individual face (emotional condition, not shaved 
recently, etc.), but should be highly discriminative to 
differences between individuals. To achieve this goal, 
several images taken under different views and illumina¬ 
tion conditions are commonly stored in the database. In 
many applications it is feasible to acquire the example 
images under (or at least close to) standardized condi¬ 
tions, such as frontal view and neutral facial expression. 
Moreover, during the recognition phase, it is often possi¬ 
ble to repeatedly take snapshot images until one comes 
close enough to the standard conditions (as mentioned 
in [28] for instance). 

A different paradigm applies for applications where 
recognition is used for validation or verification only. For 
instance, in an access control system, there may be prior 
information about the person’s identity available, e.g., 
by means of a personalized key or code-card. Such a 
system has only to decide whether the match with a 
selected entry in the database is good enough in order 
to verify the identity of the actual person. An exhaustive 
search for the absolute optimum of the cost function — 
that is commonly used in recognition — is not feasible 
here. Other means for normalizing and thresholding the 
cost function have to be utilized. Of course, the rate of 
false positive decision should be very low. 

In video-conferencing the challenge is to achieve 
high fidelity (high resolution, realistic colors and quasi 
real time) transmission between communicating parties, 
while keeping the required transmission bandwidth as 
small as possible. A reasonable assumption is that the 
identity of the person is known and does not change 
throughout a session (i.e., video-conference call). This 
assumption allows to exploit knowledge and examples 
accumulated during previous sessions of the same per¬ 
son. As opposed to face recognition systems, a video- 
conference system must be very sensitive in explicitly 
detecting even minor individual variations. The detected 
variations (either with respect to the previous images, or 
with respect to similar example images) have to be pa¬ 
rameterized and coded to facilitate efficient transmission 
but without sacrifice of detailed reconstruction. 

For a video-conference system, we cannot significantly 
restrict the range of admitted head poses or limit the va¬ 
riety of facial expressions. The only applicable assump¬ 
tions arise from the physical and anatomical limitations 
of the human face, head and body. 

Another difference is rooted in the more passive char¬ 
acter of our task. We cannot repeatedly acquire new im¬ 
ages until we have a suitable one, but we must process 
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Table 1: The most important differences between face recognition and video-conferencing (transmission for recon¬ 
struction of images of 3-D objects) are summarized here. 


face recognition 

video-conferencing 

task is comparison and matching 
against (large) example database 

task is “best” possible reconstruction 
with small channel capacity 

examples in database are from different 
faces 

all examples are of the same face 

discrimination of intra-individual fea¬ 
tures is important 

additional information for identification 
is available 

should be invariant to individual 
variations 

individual variations are most 

important 

roughly standard pose of all faces 

no a priori standard pose possible 

roughly standard facial expressions 
possible 

all facial expressions must be admitted 

deals with static images 

quasi-continuous sequence of images 

image acquisition can be repeated 

no repetition, real time required 


the incoming sequence of images. Moreover, we cannot 
direct a person to behave in a certain way (e.g., to obtain 
approximately standard conditions) — as is possible for 
recognition, at least while building the database. On the 
other hand, the high sampling rate within the incoming 
image sequence allows one to exploit smoothness in time 
(due to the inertia of physical objects). Thus, the dif¬ 
ferences between successive frames are small and finding 
correspondences is easier than for recognition. The algo¬ 
rithms need not start from scratch for each new image, 
but can rely on predictions derived from previous images 
to increase stability and coding efficiency. We have good 
reasons to expect that real time operation (video frame 
rate) can be achieved in the very near future with an 
affordable platform. 

3 Outline of the approach 

In this section we outline our approach to video- 
conferencing. Due to limited space the scope will be 
restricted to five topics that have strongest impact upon 
the system architecture presented in Section 4. 

3.1 Examples as model 

Several conventional approaches for model-based coding 
of face images are reported in the literature. Most of 
them rely on an explicit metric 3-D model (wire frame 
model) of the specific face [1, 44]. These volumetric mod¬ 
els are obtained from image data under different view¬ 
points or from laser range-scanners. Shape-from-motion 
algorithms are known to be not very stable and quite 
noise sensitive. Recently, several structure-from-motion 
algorithms have been demonstrated to yield good results 
from real image sequences. However, they crucially de¬ 
pend on stable features that can be accurately localized 
and tracked in the images. In face images, features of 
this kind are not present in sufficient number or quality. 
In some work auxiliary marks (like white points in [1]) 
were attached to the person’s skin. While such aids may 
be useful for research purposes they are certainly not ac¬ 
ceptable for a commercial video-conference system. On 


the other hand, laser range-scanners are very expensive, 
comparatively slow, and difficult to handle (due to sub¬ 
tle scanning mechanics). The problems are even more 
severe for systems that have aligned video cameras to 
simultaneously record images for texture mapping. 

In contrast to these more conventional concepts for 
video-conferencing we want to avoid the detour of ex¬ 
plicit 3-D models. In this paper we advocate a model- 
based coding scheme to reconstruct images of 3-D ob¬ 
jects (e.g., faces) directly from images. The model is 
based on a set of 2D example images of a person’s face. 
We assume that the set of example images is acquired at 
the beginning of a session; possibly the system may fall 
back upon an example database from previous sessions. 
The set of example images is initially transmitted to the 
receiver (by means of conventional image compression 
techniques). During the subsequent continuous trans¬ 
mission the examples are already stored on the sender 
(encoder) and on the receiver (decoder) side. Therefore, 
approaches of this kind are also called “memory-based”. 

The stored example images span a high dimensional 
space of different poses, viewpoints, facial expressions, 
and also illumination conditions. As suggested for in¬ 
stance by Poggio & BrunelU ([46]) object images can be 
generated by interpolating between a small set of exam¬ 
ple images. They described this interpolation in terms 
of learning from examples [45, 47, 48]. 

To accomplish the interpolation for image reconstruc¬ 
tion, two different approaches are conceivable. The first 
is related to a new approach to computer graphics [46]. 
This method has been proposed to synthesize new im¬ 
ages from examples attributed with specific and given 
parameters. For instance, the image sequence of a walk¬ 
ing person can be interpolated over time from a few im¬ 
ages showing distinct postures; here the parameter is 
simply time (cf. [46]). In computer graphics we can inter¬ 
actively choose the right examples and tailor the param¬ 
eterization to synthesize images that are close enough 
to what we want by interpolation in a relatively low¬ 
dimensional parameter space. Some applications for spe- 
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cial animation effects in movies can also be found in [64]. 

The second approach has the same memory-based fla¬ 
vor and is in fact provably almost equivalent. The first 
step is to find the nearest neighbors, i.e., the most similar 
examples according to some appropriate distance mea¬ 
sure, to the novel view within the database. The follow¬ 
ing step may then estimate the weight for each neigh¬ 
bor to yield the best interpolation (that is, the clos¬ 
est weighted combination to the novel image) between 
these examples. There may also be higher dimensional 
cases where better interpolation can be achieved if exam¬ 
ples other than the nearest neighbors are used [10]. For 
the purpose of video-conferencing this second approach 
seems more natural. It is in fact not immediately obvi¬ 
ous how details of facial expressions should be parame¬ 
terized in a video-conference system. Moreover, the first 
method requires explicit estimation of these parameters 
in addition to pose. The second approach avoids the 
bottleneck of predetermined parameterization, but se¬ 
lects the basis for interpolation in a more adaptive way. 
Thus, it is potentially more flexible at the expense of a 
data-dependent higher dimensional parameter space. 

To evaluate the feasibility of this concept in the pre¬ 
liminary implementation of this paper we will consider 
only the nearest neighbor in the database. Beymer, 
Shashua & Poggio [10] have provided a preliminary eval¬ 
uation of the two approaches for video-conferencing. 

3.2 Number of examples 

The major objection to the example-based approach for 
a video-conference system might be an excessive require¬ 
ment of memory to store the example database common 
to the sender (encoder) and the receiver (decoder) side. 

Due to today’s semiconductor technology, however, 
fast memory is affordable in abundance. For instance, 
standard RAM chips with 16 MB can already accommo¬ 
date 256 images of full size (256 x 256 pixels) without any 
further compression. Therefore, storage capacity does 
not cause an insuperable problem. 

On the other hand, the costs for initial transmission 
of the examples should be kept as low as possible. Thus, 
an interesting question is to what extent the number of 
examples can be reduced without restricting the variety 
of expressions that can be synthesized. 

Loosely speaking, one can think of a high dimensional 
space that is defined or spanned by the example images. 
The dimension of this space is related to the number 
of distinct face images that can be generated. In other 
words, we want to reduce the number of examples used as 
nearest neighbors or for interpolation without reducing 
the dimensionality of this example space. 

There are several possibilities. The common thread is 
to divide the abstract example space into lower dimen¬ 
sional subspaces. This, however, relies on the assump¬ 
tion that various properties of face images are separable, 
i.e., certain aspects of the images are to a large extend 
independent of others. For reconstruction, a face image 
can then be composed of intermediate results obtained 
within the subspaces. 

Clearly, the concrete separation into subspaces must 
be subjected to experimental justification. The next sec¬ 


tions discuss some ways to reduce the number of exam¬ 
ples, i.e., reduce the costs for initial transmission and 
storage, while at the same time preserving the dimen¬ 
sionality of the example space. 

3.3 Faces as semi-rigid objects 

Human heads/faces are representatives of a special class 
of objects; we call them semi-rigid objects. The name 
accounts for the fact that these objects are not thor¬ 
oughly rigid, but at a larger scale still approximately re¬ 
tain their shape. A prerequisite is that a decomposition 
of the object dynamics into the motion of the object as 
a whole (e.g., translation of center of mass and rotation 
about an axis through this point) and the variation of 
the object’s shape makes sense. Moreover, the dynamic 
range of variations in the object shape is small compared 
to the total object extension. In other words, the vari¬ 
ation in shape can be formulated as a perturbation to 
a standard or average shape. Since the object shape is 
subjected to variations, we cannot expect to find points 
on the object surface that are fixed with respect to any 
object-centered reference frame, e.g., as is defined by the 
skull. Therefore, it will, in general, not be possible to 
infer the object’s position based on observations of lo¬ 
calized feature points on the surface. 

An experimental finding for face images is that facial 
expressions as well as detailed features and the overall 
shape of the head become apparent at different ranges 
of spatial frequency. This is demonstrated in Figure 4. 
Based on this observation, we conjecture that the pose 
of the semi-rigid head can be estimated from the low- 
frequency images alone, thus discarding the variations 
due to facial expressions. 

In face recognition commonly labeled feature points 
like eyes, corners of the mouth, etc. (see Figure 3 for il¬ 
lustration) are used to compensate for the pose. This re¬ 
quires detection and accurate localization of correspond¬ 
ing points in a new image and the example images in the 
database. Reliable detection and precise localization of 
such predefined and labeled features in one image are 
already difficult. Furthermore, inferring the pose from 
correspondence of such feature points across distinct im¬ 
ages is prone to errors in the presence of facial expres¬ 
sions. For instance, the pupils of the eyes may move by 
more than 1 cm to both sides due to changes in direc¬ 
tion of gaze and vergence movements as is depicted in 
Figure 3. Also, the pupils may temporarily disappear 
when the eyelids are closed during twinkling or blinking. 
The corners of the mouth are rather unreliable feature 
points for pose estimation, since they are subjected to 
substantial movements with respect to the head due to 
facial expressions and during normal speech. 

To circumvent the problems sketched above, we pro¬ 
pose an adaptive strategy for robust pose estimation 
and compensation that is suitable for video-conferencing. 
Details are described in Section 4.1. In a nutshell the 
main ideas are: to make use only of the low-frequency 
bands within a coarse-to-fine refinement technique to 
estimate the pose; to use the constant brightness con¬ 
straint to find correspondence for all image points; to 
rely on lower level criteria to select adequate correspon- 
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Figure 2: Description at different time-scales gives rise to 
a hierarchical system architecture (see text for details). 


dence points based on a confidence measure; and, finally, 
to use robust statistics to estimate: model parameters 
within the multiresolution refinement. 

3.4 Description on different time scales 

In what follows we consider the object model on different 
time-scales. Based on this decomposition, we derive a hi¬ 
erarchical architecture for a video-conference system. At 
least three different time-scales should be distinguished 
as is depicted in Figure 2. 

Notice that the general idea is by no means restricted 
to video-conferencing, but carries over to recognition. 
This claim is reflected in the three lines under the ar¬ 
row in Figure 2 leading from a general formulation (top) 
to the specific example of faces and video-conferencing 
(bottom). 

It is a reasonable assumption that the object class a 
system has to deal with remains the same over a “long” 
period of time or does not change at all. Thus, on the 
slowest time scale we can assume that we ha.vq prior 
knowledge about, the object class. This suggests the use 
of a generic object model such as an average or prototyp¬ 
ical face. Some appropriate model assumptions at this 
level are: a rough model of the 3-D shape of a face/head 
(e.g., a quadric surface, see [53, 54] for details); human 
heads exhibit approximately bilateral symmetry; the set 
of constituents of a face (two eyes, nose, mouth, two 
ears, etc.); stable features of constituents (the pupil is 
round and darker than the white eye bulbus, teeth are 
white, the holes in the nose are dark, the relative size be¬ 
tween different parts, etc.); the geometric arrangement 
of constituents (nose is between mouth and eyes, and 
vertically centered between both eyes, the ears are far 
apart., etc.); the variability of constituents in color and 
in geometry (eyes are fairly fixed in the head, location of 
pupil changes with direction of gaze, height, of an open 
mouth does not. exceed its width, etc.). 

On the intermediate time-scale we are concerned with 
a, specific representative of the class. The object, model 
is refined by accounting for individual features that, are 
specific to a. person’s face. Typical features that, are fairly 
stable are, for example: the 3-D shape of the head in¬ 
sofar as determined by the skull (excluding the region 
around the mouth, of course); the color of eyes, hair, 
and teeth; to some extent, also the taint, and the texture 
of the skin; typical dynamics of facial expressions and 
miming, defects (e.g., scars left, from operations or acci¬ 
dents) or irregularities (e.g., moles or stained spots) in 
the appearance. 

Finally, on the fastest, time-scale we have to deal with 
the variations of an individual object, instance. Even 


for many non-rigid objects it. makes sense to decompose 
the description of the object, dynamics. For instance, as 
the motion (translation and rotation) of a. local reference 
frame (e.g., center of mass as origin) 1 and a. description 
of the non-rigid dynamics with respect, to this object, 
centered reference frame. For the video-conferengb sys¬ 
tem we therefore want, to separate the estimation of the 
global pose from the dynamics of the more local facial 
expressions. 

3.5 Decomposition into patches of interest 

A further way to reduce the number of examples and the 
storage requirements is to subdivide the normalized fads 
images into disjunctive or overlapping subregions. Such 
subregions clipped from the original image may have ar¬ 
bitrary shapes to allow for maximal generality. They are 
called patches of interest. (POI) in the sequel — as op¬ 
posed to the rectangular regions of interest. (ROI) com¬ 
monly used. This concept, is suitable to animate fine 
facial movements and facial expression by blending the 
subregions together to reconstruct, a. composite image at. 
the receiver side. The most, important. POIs are located 
around the eyes and the mouth. Note that, because of 
the arbitrary shape, different. POIs for the eye pupil, eye¬ 
lid, and eyebrows could be used as well as different. POIs 
for the corners of the mouth and teeth. In Section 4.3 
two algorithms to find the nearest, neighbor of an image 
region are described. A versatile algorithm for blending 
possibly overlapping POIs of arbitrary shape together is 
described in Section 4.4.1. 

Interestingly, the number of subimages (pasted into 
a. base image of a. face) needed to achieve realistic an¬ 
imation is often surprisingly small. This fact, has also 
been pointed out. in the literature (see [24] for example). 
A realistic animation of an eye blink can be achieved 
by using only four distinct, subimages if the eyeball is 
fixed. Duffy noted that, only five different, images are 
required for satisfactory simulation of natural eye rnovfrf 
rnent.s. However, this should be considered as a. lower 
bound to achieve realistic animation of the position of 
the eyeball (without, interpolation between the exam¬ 
ples). Also, some examples for different, pupil diameters 
will be required. Moreover, in experiments on lip-reading 
it. has been demonstrated that, the essence of a. conversa¬ 
tion can be picked up from a. sequence of images alone. 
Remarkable, however, is that. a. set. of only 19 visually dis¬ 
tinguishable images (showing particular arrangements of 
lip, tongue, and teeth) is sufficient. [30]. Although for re¬ 
alistic video-conferencing a. larger number of images may 
be required, this result, is very encouraging. 

The advantage of having distinct, example patches for 
different, regions of a. face is manifold. First, of all in¬ 
stead of requiring separate example images of the whole 
face for all possible poses, lip-shapes, eye positions, etc., 
a. face can be reconstructed from a. smaller number of 
example patches for subregions. For instance, we want, 
to synthesize various face images with eye blinks and 
speech. Assuming the above mentioned number of ex- 

1 More generally, we may admit not only rigid transforma¬ 
tions; the local reference frame may be non-ort.liogonal and 
time-varying. 
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ample images for the subregions (let’s say 20 different 
examples for the mouth and 5 distinct images for each 
eye) a total of 500 different combinations can be ob¬ 
tained. Thus using only 31 (including a base image for 
the whole face) examples a much larger number of face 
images can be animated. Notice that the gain in the 
number of possible combinations will increase dramat¬ 
ically the more we can subdivide the high dimensional 
space of possible faces into lower dimensional subspaces. 

In addition to significantly reducing the number of 
needed example images, the memory required to store 
the POIs is much less than for the whole face images, 
since the memory required for a POI is roughly pro¬ 
portional to the percentage of the area it covers in the 
image. For the images shown in this paper (for instance 
left the image in Figure 5) this yields approximately 1% 
for each eye POI and about 2-3% for the region around 
the mouth. 

Another benefit of utilizing POIs stems from geomet¬ 
ric constraints of the imaging process. The subregions of 
POIs usually will be projections (perspective projection 
in general) of a small patch of the 3-D face surface. For 
most POIs the corresponding 3-D patch will have little 
depth structure and therefore can be well approximated 
by a plane. Each POI can be subjected to separate trans¬ 
formations that account for the global head pose before 
the subregions are blended together. Such transforma¬ 
tions can be affine transformations in the image plane or 
even projective transformations of a plane in 3-D. Even 
if the 3-D surface is not exactly a plane, perspective dis¬ 
tortions and occlusion will be much less of a problem 
for smaller patches. The advantage is that it may not 
be necessary to have example subimages for all the va¬ 
riety of head poses. Rather, only a few samples may 
be sufficient. This is because intermediate views can be 
generated by appropriate transformations with good fi¬ 
delity. Thus, fewer examples are required to synthesize 
faces with the same variety of expressions. 

Probably the most compelling argument for decom¬ 
posing face images into subregions has to do with the 
nearest neighbors procedure. The most conspicuous fa¬ 
cial features to a human observer cover only a small frac¬ 
tion of the overall face image, e.g., we are very sensitive 
to even minor variations around the eyes. Any proce¬ 
dure to assess similarity between face images that relies 
on the whole face will be less sensitive to those local vari¬ 
ation. Take, for instance, normalized cross-correlation 
of the full images. The correlation value is likely to be 
dominated by small illumination differences affecting the 
whole image rather than by local variations in the “sen¬ 
sitive” regions around the eyes. 

Notice that many of these arguments carry over to 
conventional model-based approaches. Using a physical 
3-D model may remedy problems of perspective and oc¬ 
clusion. But still, the appropriate texture (e.g., mouth, 
eyes open or closed) to be mapped is required. This 
texture is supplied as a 2-D grey-level or color image in 
a standard view. The task of generating this image is 
essentially the same as in the example-based approach 
advocated in this paper. 


4 System architecture 

In this section several algorithms that have been de¬ 
veloped so far within the novel framework for video- 
conferencing will be presented and discussed in more 
detail. Before doing so, we will sketch a possible archi¬ 
tecture for a very simple system. The intention is not to 
present a working system, but to outline its major com¬ 
ponents so that the developed algorithms can be seen in 
an appropriate context. 

A simple system architecture should comprise the fol¬ 
lowing components: 

1. compute normalized pose of new facial images rel¬ 
ative to a given example and estimate pose param¬ 
eters 

2. find nearest neighbors within an example database 
of subimages, e.g., regions around eyes, mouth, 
nose, etc. 

3. transmit model parameters, such as pose parame¬ 
ters and index numbers of the nearest neighbors 

4. reconstruct the face image on the receiver side by 
blending the regions of the subimages together 

5. transform the composed face image into the pose 
of the original image on the sender side 

A desirable extension to this simple scheme is to “inter¬ 
polate” each subimage between several suitable examples 
(see [10]). Also, adequate ways to update the example 
database automatically have to be devised. 

4.1 Automatic and robust pose estimation 

In what follows we describe a novel algorithm for auto¬ 
matic pose estimation and normalization of new face im¬ 
ages relative to a given example, i.e., a reference image. 
Some of the techniques used in this approach to pose es¬ 
timation are somewhat established in motion estimation 
and are reviewed briefly for the sake of completeness. 
We will emphasize, however, the original parts of our 
algorithm, based on the discussion of Section 3. 

The new algorithm can be sketched as follows. Using 
a restricted affine model for the transformation, four pa¬ 
rameters (translation, fronto-parallel rotation, and scale) 
are estimated in a correspondence scheme on the coarse 
resolution levels of a Laplacian pyramid only. Local con¬ 
fidence measures for correspondence are used as statisti¬ 
cal weights during least squares error (LSE) fit of these 
parameters. Off-plane rotation is handled by separate 
example images. The error between the unconstrained 
measured displacement vector Held and affine motion 
transformation at higher resolutions can used to assess 
the similarity between both images (see Section 4.3.2). 

4.1.1 Analysis in spatial frequency bands 

All computation is performed in a hierarchical data 
structure of the kind originally proposed by Tanimoto & 
Pavltdts to speed up various image processing operations 
[57]. A comprehensive overview of multiresolution image 
processing and pyramid structures is given by Rosenfeld 

[50]. 
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For each image we compute a multiresolution pyra¬ 
mid, where I(x,y,i) denotes the discrete grey-level im¬ 
age. The correspondence algorithm can be applied using 
either Gaussian or Laplacian pyramids. We compute 
these pyramids, adopting the algorithms proposed by 
Burt [16] and Burt & Adelson [17]. A multiresolution 
pyramid is a stack of N levels of progressively smaller 
versions of the original image. Let l denote the level 
within the pyramid and let G l be the reduced image at 
the /-th level. The bottom of the pyramid is the original 
image itself, i.e., G° = I. The image array G l+1 at a 
higher level is a lowpass filtered and subsampled copy of 
its predecessor G l . The images that form the Gaussian 
pyramid are computed recursively by applying a reduce 
operator to the previous level: 

G ,+1 = reduce G l . 

This procedure is iterated until the highest level N is 
reached. The reduce operator performs a convolution 
with a smoothing kernel and a subsequent sampling at 
every second image location in G l , i.e., every second row 
and column is skipped. Thus, the size in each direction 
of the image array reduces roughly by a factor of two 
between successive levels. As a consequence, the spatial 
resolution and image size shrinks to a quarter. In our 
implementation we use a separable and symmetric 5x5 
smoothing kernel. Its ID coefficients are derived from 
the binomial distribution, e.g., pg(l, 4, 6, 4, 1). 

A Gaussian pyramid consists of an ordered set of low- 
pass filtered versions of the original image. For the pre¬ 
viously discussed type of Gaussian pyramids, the filter 
bandlimit reduces by an octave (factor of two) from level 
to level. A Laplacian pyramid may be regarded as a stack 
of bandpass filtered image “incarnations”. The name 
arises from the fact that the Laplacian edge detector 2 
commonly used in image enhancement can be approxi¬ 
mated by the difference of Gaussians [40]. The optimal 
ratio — the one that leads to the best approximation 
— of standard deviations for inhibiting and excitatory 
Gaussian about is 1.6. An efficient algorithm to compute 
a pyramid of bandpass filtered images having a ratio of 
-y/2.0 is the DOLP transform (difference of lowpass trans¬ 
form) proposed by Crowley & Stern [23]. However, here 
we construct a Laplacian pyramid from the difference 
of images at adjacent levels of the Gaussian pyramid as 
proposed by Burt & Adelson. Therefore, the ratio of the 
standard deviations is 2.0. This results in a broader filter 
bandwidth. This is favorable to achieve consistent mo¬ 
tion estimation across several frequency bands (pyramid 
levels) since there is significantly more overlap between 
adjacent bands. The center frequency of the bandpass 
changes by an octave between levels. 

The levels L l of the Laplacian pyramid can be gen¬ 
erated from the Gaussian pyramid by using an expand 
operator: 

L 1 = G 1 — expand G i+1 , 

were we define L N = G N for the highest level N. The 
expand operator may be thought to do basically the re- 

2 Formally this is the Laplacian operator applied to 
a Gaussian convolution kernel of standard deviation a: 
(V 2 G ct ) * I(x,y). 


verse of reduce. Its effect is to expand an image ar¬ 
ray G l+1 to an array having twice the linear extent by 
interpolating values at intermediate locations at level l 
between given sample points in G l+1 . 

The upper limit for the storage requirement of such 
a pyramid is IS 1 , were S is the memory required for 
the original image. Moreover, the computational costs 
increase linearly with S. 

4.1.2 Estimation of local transformation and 
confidence values 

Let us suppose that we have two similar face images of 
the same person. We name the first image E = E(x, n), 
indicating that it is one of the example images in the 
database (n is an index number), and a second new im¬ 
age I = 7(x,f) acquired by the video camera at time 
t (with x = (x,y) T being the location in the image). 
Our task is now to bring the new image I to the clos¬ 
est possible alignment with E. Moreover, we want to 
obtain a robust estimate for the transformation parame¬ 
ters despite the fact that both facial expressions may be 
significantly different (see Figure 3 for examples). 

Our novel algorithm that achieves this goal can be 
subdivided into two main steps. Firstly, we describe a 
differential technique for local alignment of small image 
patches in two images. Secondly, we present an algo¬ 
rithm that fits the parameters of a restricted affine model 
to describe the global transformation between the two 
faces due to different poses in both images. Even though 
the discussion here uses the terminology tailored to our 
video-conference system, the results and algorithms can 
be generalized to other problems. 

In our derivation we follow the lines of Lucas & 
Kanade who first proposed a differential algorithm for 
image registration and stereo [37, 38]. However, more 
recently related techniques have been presented in vari¬ 
ous flavors in the context of motion estimation or optical 
flow computation [26, 31, 43, 32, 34, 29, 55]. A compre¬ 
hensive survey and comparison of differential and other 
optical flow techniques is given by Barron, Fleet & Beau- 
chemtn [7]. Unfortunately, they do not consider coarse- 
to-Hne methods that are essential to extend the velocity 
range of differential techniques. 

We assume that at a sufficiently low resolution both 
images are locally similar and that a small patch in I 
can be approximated as being a shifted version of a cor¬ 
responding patch in E. That is: J(x) = E(x + d(x)), 
were d(x) is the local displacement vector that we want 
to estimate. In the case of optical flow techniques E 
and I are two consecutive images taken from a motion 
sequence and d = v • At depends on the instantaneous 
local velocity v and the time interval At. 

In order to align both image patches we have to search 
for the displacement d that minimizes a distance mea¬ 
sure between the patches in I and E. A typical measure 
is the L 2 -norm of the grey-levels over a certain neighbor¬ 
hood f l centered around the image point x. Moreover, we 
assume that d(x) varies only smoothly and thus can be 
modeled to be constant over £2. This is reasonable since 
we apply this procedure to bandlimited images anyway. 
In addition we want to allow for a weighting function 
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LF(x) > 0 , which gives us the freedom to emphasize the 
central region of 12 over the periphery. In order to find 
d we can formulate a least squares problem. Thus, we 
want to minimize: 

e = W(x) ||7(x) — E(x + d ( x ))1 12 • (1) 

xen 


We approximate 77(x + d( x )) by a Taylor expansion 
truncated after the linear term and differentiate the error 
e with respect to d. The displacement that minimizes 
( 1 ) is the solution of the equation system 


Dd = c , where c = 


E W 2 (tc)I x (tc)AI(tc) 
£ w 2 (x)4( x )a/( x ) 


( 2 ) 


( E^ 2 (x)4M E f2 (x)4(x)4(x) A 

1, E^ 2 (x)4(x)4(x) E^ 2 (x)/ 2 (x) )■ 


Here we have introduced the abbreviations A7(x) = 
7(x) — 77(x) for the difference of the intensity values and 
I x = dl/dx for the partial derivative, for I y respectively. 
Indication of the explicit dependence of D, d, and c on 
x is omitted. 

In our implementation we use a five tap central differ¬ 
ence mask to approximate these spatial derivatives, e.g., 
the coefficients are (—1, 8 , 0, — 8 , 1). This is a reason¬ 
able compromise between computational cost and good¬ 
ness of the approximation provided that the signal is 
sufficiently bandlimited. The spatial neighborhood SI is 
a 5 x 5 square centered around the actual image point. 

The important fact is, that in addition to the esti¬ 
mated local displacement d(x), we can obtain an associ¬ 
ated reliable measure of its correctness. This confidence 
measure 7(x), as it will be called in the sequel, is used in 
the second step to weight each displacement vector when 
we fit the parameters of a low order polynomial model 
for the global transformation. 

In the rest of this section two questions will be dis¬ 
cussed: a) what are the solutions of ( 2 ), and b) what is 
the optimal confidence measure for our purposes. Some 
of the following items have been addressed in individ¬ 
ual papers on optical flow techniques, but we think it is 
worthwhile to repeat them in the context of our specific 
application. 

Note that the matrix D has two important properties 
that will be exploited in the sequel. Firstly, it is symmet¬ 
ric, i.e., D = D t . Therefore, D has two real eigenvalues 
Ai,A 2 E IR and the corresponding eigenvectors are or¬ 
thogonal if the eigenvalues are distinct, i.e., Ai 7 ^ A 2 . 
Secondly, D is positive semi-definite (the quadratic form 
xDx > 0 V x E IR 2 and x ^ 0) as can be verified by 
Sylvester’s criterion [35]. Consequently the eigenvalues 
are nonnegative (Ai,A 2 > 0). The eigenvalues are com¬ 
puted as the roots of the characteristic quadratic poly¬ 
nomial in our implementation. Let A m ; n = min(Ai, A 2 ) 
be the smaller eigenvalue, and A max = max(Ai,A 2 ) be 
the larger one. 

In order to solve (2) for the displacement d three dif¬ 
ferent cases have to be distinguished: 

1. If det(D) 7 ^ 0 the inverse D -1 exists and ( 2 ) can 
be solved for the 2-D displacement d. However, in 


practice the determinant has to exceed a certain 
threshold to ensure stable results: det(D) > r<j e t- 
If the matrix D is singular, i.e. det(D) = A m ; n • 
Amax < idet, we must distinguish the following two 
cases. 

2. If A max > 0 A A m ; n — 0, i.e., A max A 
t m ax A A m in < T m ; n in our implementation, we 
have 

det(D) =E^ 2 (x)4 2 (x)E^ 2 (xE 2 (x) 

- (E w /2 (x)4(x)4( x )) 2 =°- 


This is satisfied if I x (x) = const • 7y(x) V x £ Q. 
The interpretation is that the image intensities 
within the region 12 lie on a plane and consequently 
all spatial gradients have the same direction. This 
situation represents the well-known aperture prob¬ 
lem and we can only determine the normal compo¬ 
nent of the displacement: 


d n (x) 


AI (pc) V7(x) 

11 VZ(x) 11 || V7(x)||' 


3. If A max = 0 all the entries in D are zero. The sit¬ 
uation A max < T max may occur in practice if the 
image does have insufficient texture within the re¬ 
gion f2. Consequently, the spatial gradients nearly 
vanish and we cannot determine any component of 
displacement. 

The confidence measure 7(x) associated with the dis¬ 
placement vector d(x) can be derived from the entries 
in the matrix D(x). Several ways to do this have been 
proposed in the literature: 

1. Simoncelli, Adelson & Heeger presented a Bayesian 
framework for optical flow computation [55]. They 
emphasized the relevance of the trace of the spatial 
derivative matrix for the probability distributions 
of velocity vectors. Here, we have for the trace of 
D: 

tr(D) = Ai + A 2 

= ^IC 2 (x) 7 2 ( x ) + ^IC 2 ( x )7 2 (x) 


2. It is obvious from the previous discussion that the 
larger det(D) is, the more stable is the solution of 
the linear system ( 2 ). 

3. JJras et al. [63] proposed the smallest condition 
number 3 «( H) of the matrix H as an accuracy 
criterion. The matrix H is the Hessian of image 
intensity 7(x, t). It arises from an optical flow tech¬ 
nique using second order constraints to recover the 
2-D-velocity locally (see also [29]). 

Based on this approach Toelg developed a refined 
and robust algorithm that is used in an active vi¬ 
sion system [60, 59]. However, in extensive exper¬ 
iments the magnitude of the determinant det(H), 

3 The condition number is defined as the ratio between 
the largest and the smallest absolute eigenvalue of a matrix 
(cf. [49]). A matrix is ill-conditioned if its condition number 
k is too large, and it is singular if k is infinite. 
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i.e., the spatial Gaussian curvature in the intensity 
image, turned out to be a better confidence mea¬ 
sure. This finding is in accordance with the more 
recent results discussed in [7], 

Here, we may use the condition number of the ma¬ 
trix D: k(D) = A max /A m ; n as a measurement for 
reliability. 

4. Here, we advocate the magnitude of the smallest 
eigenvalue A m ; n = min(Ai,A 2 ) as an appropriate 
confidence measure. This is along the lines of the 
practical results reported in [7]. 

A brief justification for this choice will be given. We 
are only interested in the first case for solving (2) where 
the full 2-D displacement vector can be recovered reli¬ 
ably. Using A m i n as a confidence measure gives a lower 
bound for the determinant of D, since det(D) > A^ in . 
Of course this also gives a lower bound for the trace of D, 
since tr(D) > 2A m ; n . Moreover, a larger A m ; n gives rise 
to a condition number k(D) closer to unity in the imple¬ 
mentation. This is because A max is bounded from above 
due to spatial lowpass filtering and due to the limited 
range of intensity values. It is interesting to note that 
for any 2x2 matrix, the characteristic equation can be 
written as: A 2 — Atr(D) + det(D) = 0. We conclude that 
taking the magnitude of the smallest eigenvalue A m ; n as 
a confidence measure implies all the other discussed cri¬ 
teria. 


4.1.3 Fitting global parameters for the pose 
model 


We will now derive the second step of the pose estima¬ 
tion algorithm. The local displacement vectors d(x) and 
their associated confidence measures fc(x) are used to 
estimate parameters for the global pose transformation 
model. 

We assume an affine model for the displacement vec¬ 
tor Held. This is a reasonable assumption, since we es¬ 
timate the pose using low spatial frequency images of 
the face only. We want to discard facial expressions that 
have very little effect on these low resolution images as 
discussed in Section 3.3. We will outline only the basic 
idea in this section. The reader is referred to Appendix 
A for mathematical details. 

The estimated affine displacement field d(x;) is deter¬ 
mined at any image location x; = (aq, t/ 8 ) T by six model 
parameters in the general case: 

d(xj) = Ax,' + t (3) 

with 


A = 


and t 


a x 

a y 


(4) 


Suppose we have n image points x; (i = 1 , ,n) with a 

measured displacement d(x;) = (d x (xi, yi), d x (xi, yi)) T 
and associated confidence measures fc(x;). 

In general (3) cannot be satisfied exactly for all points. 
Instead, we want to find the parameter vector p = 
(a x , b x ,c x ,a y ,by,c y ) T (cf. equation (14) on page 20) that 
minimizes the error between the measured displacement 
field d(xj) and the fitted affine displacement field d(x 8 ). 


We assume Gaussian statistics of the process and use the 
i 2 -norm as a distance measure. So, we want to minimize 
the sum of the squared differences (SSD) over all image 
points: 


e 



d(xj-) - d(xj-) 


(5) 


l 


where we assume a weight wf for each data point x;. 
These weights are computed as the values of a monotonic 
function wf = s(fc(x;)) of the associated confidence mea¬ 
sures. The function s(.) must be nonlinear to decrease 
the range, which turned out to be too large. We utilize 
a sigmoid-like characteristics. In our experiments the 
choice of w ! = VH Xj) worked very well. 

The solution for this weighted least squares problem 
is found by the weighted pseudo-inverse as given in (18). 
In the general case this leads to six equations for six 
parameters as given in (19) and (20). 

The affine model cannot account for perspective ef¬ 
fects and occlusions such as occur during off-plane ro¬ 
tation of the face. This kind of head movement must 
be handled by separate example images. Therefore, we 
do not want to allow for any component of shear and re¬ 
duce the degree of freedom in the affine model. To admit 
only translation, scale and in-plane rotation in the model 
we impose additional constraints on the transformation 
matrix (see Appendix A.2 for details): 

A = SR — I (6) 

with 


S = 


s 0 
0 s 


R = 


cos a sin a 
— sin a cos a 


(7) 


I is the identity matrix, s denotes the isotropic scale fac¬ 
tor, and a is the angle of rotation in the image plane. 
These constraints result in a coupling between the equa¬ 
tions for the general case. The system can be reduced to 
four equations (see (26) and (27)) in the four model pa¬ 
rameters a x ,a y ,s,a (see (25) on page 21 for definition). 

Further simplification can be achieved by introducing 
a new barycentric coordinate system (see Appendix A.4). 
The origin of this new reference frame coincides with the 
weighted center if gravity of the image. Expressed in this 
new reference frame the equation systems take a much 
simpler form. 

For the sake of generality, the case of a full affine 
transformation is derived first. The new equations for 
the general case with six free parameters are given in 
(37) and (38) on page 22. They can be directly solved 
for the translation parameters a' x ,a' y . For the remaining 
parameters only two decoupled 2x2 systems have to be 
solved. 

However, even more significant is the advantage of the 
new reference system in the constrained case where we 
do not allow for any component of shear (see Appendix 
A.4.2). We obtain a 4 x 4 equation system. As it is 
obvious from (42) the corresponding matrix is diagonal. 
Hence, the system can be solved directly and the solution 
can be written in closed form (see (44) - (46)). 

The size of the face images varies because of changes in 
the distance between the camera and the person’s head. 
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We model these variations by a scale factor s in the im¬ 
age plane. Mathematically this is only correct for a size 
variation due to a change in the focal length of the lens 
while the distance is retained (no change of perspective). 
Since we apply this algorithm only to low spatial fre¬ 
quency images, the influence of perspective distortions 
and self-occlusion is mostly negligible. Our experiments 
demonstrated that this approximation is sufficient for a 
reasonable distance range (about ± 20 % around the ideal 
position) and for typical ratios of the focal length and 
the viewing distance. 

4.1.4 Hierarchical control structure 

The pose estimation algorithm is embedded in a coarse- 
to-Hne control structure. Computation starts at the 
highest level N within the Laplacian pyramid (cf. Sec¬ 
tion 4.1.1). At this level with lowest resolution all model 
parameters are initialized. On each pyramid level the 
following steps are performed in sequence: 

1. The local displacement vector Held d(x) and the 
associated confidence field fc(x) are computed in 
the way described in Section 4.1.2. 

2. The global transformation parameters of a con¬ 
strained affine model are estimated by the weighted 
least squares fit derived in Section 4.1.3 and the ap¬ 
pendix. 

3. The residual affine parameters estimated at level l 
and the parameters propagated from a higher lever 
l ± 1 are combined. As is shown in Appendix A .8 
the refined affine parameters at level l are simply 
obtained as the sum of the propagated parameters 
and the residual parameters. 

4. The estimated model parameters are then propa¬ 
gated from a higher pyramid level / ± 1 to the next 
lower level l. The relation between the parameters 
is derived in Appendix A.7. The result is: 

A' = A ;+1 and t'=2-t' +1 ; ( 8 ) 

thus, only the translation vector t i+1 has to be mul¬ 
tiplied by two. The coefficient of the matrix A i+1 
are retained. 

5. At the current level l the original image 7 0 rig i s 
warped according to the propagated model param¬ 
eters. The warp operation remaps the intensity val¬ 
ues according to: 7 warpe d( x %) = %>rig( x + d ( x )%)- 
Note that the addresses x ± d(x) in 7 0 rig w iU i n 
general not coincide with integer coordinates of the 
image array. The intensity values in 7 war p e( j have 
to be interpolated over a small region centered at 
this address. Only bilinear interpolation is used for 
the bandlimited pyramid images. The warped im¬ 
age 7 war p e( j is in closer alignment with the example 
image E. 

6 . Steps 1.-5. are repeated on successively lower 
pyramid levels, i.e., on successively higher fre¬ 
quency bands of the original images. The process 
is terminated at a given pyramid level. In our im¬ 
plementation we terminated the refinement at level 


2 or 3, e.g., at 1/16 or 1/64 of the original image 
size. 

7. After the refinement is terminated, the estimated 
affine parameters are propagated (see Appendix 
A.7) to the bottom level of the pyramid which has 
the same size and resolution as the original image. 

8 . To obtain a pose compensated face, the new im¬ 
age I is warped according to the pose parameters. 
However, here we use bicubic interpolation (e.g., 
Lagrangian interpolation [11, 5] or bicubic splines) 
for the warping, since we do not want the high fre¬ 
quency components of the original image to be sup¬ 
pressed and to reduce aliasing effects. 

4.1.5 Experimental results 

Using real image data, we will now present some typi¬ 
cal experimental results to demonstrate the robustness 
and versatility of the algorithm for pose estimation and 
compensation. 

The two images in Figure 5 might typically occur dur¬ 
ing a video-conferencing session. Here the left image is 
a reference image that would be stored in the example 
database of standardized face images. The right image 
represents a new video frame acquired during the session. 
Its pose parameters have to be estimated with respect to 
the reference image and the new image has to be stan¬ 
dardized in pose to facilitate further processing. Figure 
6 depicts the resulting images after automatic pose com¬ 
pensation. The estimation is only continued up to a final 
pyramid level. Subsequently, the parameters are propa¬ 
gated to the resolution of the original image. After esti¬ 
mation on level 3 or 2 no significant change is apparent 
when estimating at higher resolution. Table 2 gives the 
values of the corresponding pose parameters. The dia¬ 
grams in Figure 7 show different graphic representations 
of the data in Table 2. It is obvious that the parame¬ 
ters estimated at higher levels converge to their bottom 
value at the highest resolution. Furthermore, the largest 
adjustments already happen at the higher levels having 
low resolutions. 

Figure 8 shows another more “difficult” pair of im¬ 
ages. In addition to the different poses of the faces, there 
is a significant difference in the facial expressions, e.g., 
the smile in the reference image with the mouth opened, 
the teeth visible, shifted corners of the mouth, and dif¬ 
ferent direction of gaze as compared to the more neutral 
right image. In Figure 9 the resulting images after esti¬ 
mation and pose compensation up to the final resolution 
level are displayed. By visual inspection no significant 
change happens after level 2. The corresponding param¬ 
eter values in Table 4 as well as the diagrams in Figure 
10 confirm this finding. 

However, it is notable that the curves are not mono¬ 
tonic and not as smooth as in Figure 7. This suggests 
some caution and a closer examination. Figure 11 shows 
the pose parameters obtained at each resolution level. 
The four diagrams depict the results after an increasing 
number of iterations (1, 2, 3, 5, respectively) at each level 
before proceeding to the estimation at the next higher 
resolution level. Table 5 gives the corresponding param¬ 
eter values. The most dominant change occurs between 
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one and two iterations per level. This observation in¬ 
dicates that one pass per level might not be sufficient 
to achieve the best possible alignment between both im¬ 
ages, especially when dealing with more “difficult” im¬ 
ages as in Figure 8. For visual assessment Figure 12 
shows the pose compensated images for different num¬ 
bers of iterations. Although the numerical parameter 
values for one and a larger number of iterations differ — 
especially for vertical translation and in-plane rotation 
— the visual appearance is quite similar. Instructive is 
a comparison with Table 3 which represents the data for 
the more similar (in facial expression, not in pose) image 
pair in Figure 5. Here, the parameter variation for an 
increasing number of iterations per level is much less. 

To be on the safe side, two or three iterations per 
level will improve the estimation and compensation. Al¬ 
though the difference might not be noticeable by visual 
inspection, it will be beneficial for further processing 
steps. 

To demonstrate the range of variation the algorithm 
can cope with, Figure 15 gives images that have been 
aligned with various reference images. Only two exam¬ 
ples with strong facial expression for the reference images 
are reproduced here. The reference images used in Fig¬ 
ure 16 and 17 are similar to the faces in the lower row of 
Figure 3, which are likely to overtax most feature based 
algorithms. However, the results are quite convincing, 
despite the fact that only one pass (number of itera¬ 
tions i = 1) has been performed and the final estimation 
level is / = 2. The range in pose is mainly limited by 
border effects of the filtering within the Laplacian pyra¬ 
mid. The range would be extended provided that the 
face is smaller with respect to the image size. Notice 
that the pose estimation algorithm is symmetrical with 
respect to both images, i.e., it does not make a difference 
whether the new image or the reference image exhibits a 
strong facial expression. Moreover, the algorithm works 
well with strong expressions and deformations in both 
images. More examples with a variety of different face 
images and also different backgrounds are presented in 

[58]. 

4.2 Discussion of pose estimation 

The pose estimation and normalization algorithm de¬ 
scribed in this section can be located at an intermedi¬ 
ate level of complexity among models describing global 
motions of a face in images. The idea is to make some 
general assumptions about the object that give rise to 
a parameterized model. This model will be valid, i.e., a 
good enough approximation for our purpose, only for a 
limited range of poses and transformations. Some other 
linear models in increasing order of complexity ( f,n ), 

i.e., number of free parameters / and minimum number 
of corresponding image points n required, are: 

1 . pure translation in the image plane (2,1) 

2. translation and rotation in the image plane (3,2) 

3. constrained affine transformation (4,2) - the model 
used in the algorithm 

4. full affine transformation in the image plane (6,3) 
- correct for motion of a plane under orthographic 


projection 

5. projective transformation of a plane (8,4) - correct 
for motion of a plane under perspective projection 

Rotation in the image plane thoroughly compensates for 
rotation of the head in space about an axis parallel to 
the optical axis of the camera (in-plane rotation). Al¬ 
though not exactly correct in the mathematical sense, 
in practice translation in the image plane compensates 
for translation of the head in space parallel to the im¬ 
age plane. In the usual situation for video-conferencing, 
however, a person will fixate the video display and hence 
will keep the direction of gaze directed toward the nearby 
camera. Thus, shifting the head in space will most 
likely be accompanied by a compensatory rotation of 
the head. The change of the image plane scale factor 
can — to a reasonable approximation — take care of 
distance variations between head and camera; the re¬ 
quirements of weak perspective 4 are sufficiently well sat¬ 
isfied by standard imaging geometry. However, numer¬ 
ous experiments showed that more complex transforma¬ 
tions do not yield natural appearing face images and 
are very difficult or even not accessible for subsequent 
processing steps, e.g., finding nearest neighbors. For in¬ 
stance, allowing full affine transformation, the additional 
components of shear can lead to severe distortions of a 
face. Similarly, for points distant from the plane defin¬ 
ing a projective transformation severe distortions occur. 
Moreover, occlusion effects become very obvious even for 
small angles of off-plane rotation of a head (e.g., rotation 
around the neck). For these reasons off-plane rotations 
are better treated by separate example images. Taking 
these observations into account, the constrained affine 
transformation in the image plane is a good compromise 
between the achieved reduction in the necessary number 
of example images and the image fidelity. 

A further step is to use more prior structural in¬ 
formation (than the approximation by a plane) about 
faces/heads in order to increase the range of trans¬ 
formations resulting in natural looking images. Re¬ 
cently, an algorithm has been presented that applies the 
model of a general quadric surface to map two images of 
faces taken under perspective projection onto each other 
[53, 54]. Using a projective framework, a constructive 
proof is given that in general nine corresponding refer¬ 
ence points and the epipoles in two views of an approx¬ 
imately quadric surface determine the correspondences 
for all other image points of that surface. Encouraging 
results have been achieved for real face images. This al¬ 
gorithm, however, still requires manual selection of the 
reference points. Also, its robustness does not meet the 
high standards of the simpler algorithm presented in this 
paper. On the other hand, it has been demonstrated 
that the quadric surface model can successfully deal with 
small amounts of off-plane rotation of the head as long 
as occlusion effects are not too severe. 

Although the images used here have a fairly homoge¬ 
neous background (some structure from the fabric and 
due to illumination), the pose normalization algorithm 

4 For weak perspective, the depth of objects along the line 
of sight is small compared with the viewing distance. 
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would still work in the presence of a moderate amount 
of textured background surrounding the face. More ro¬ 
bustness could be achieved by a preceding segmentation 
step. Suitable techniques for segmentation of the face 
from the static background are readily available, such as 
color segmentation [2, 4] and integral projection of direc¬ 
tional information [15]. These techniques perform grey- 
level based static segmentation of single images, whereas 
motion segmentation in front of the static background 
exploits relative motion due to the unavoidable small 
jitter or movements of a person’s head (a robust mo¬ 
tion segmentation algorithm is described in [60, 58], for 
instance). Combinations of these techniques should be 
considered in order to achieve greater generality. 


4.3 Finding the nearest neighbors 

In order to find the nearest neighbor(s) in an example 
database for the subimages extracted from incoming face 
images, several approaches are conceivable. The impor¬ 
tant issue here is the similarity measure used for this 
assessment. We will describe only the two ways to find 
nearest neighbors that we have implemented. 


4.3.1 Template matching in a multiresolution 
hierarchy 

One way to asses the similarity between images or subim¬ 
ages is based on template matching by means of the nor¬ 
malized linear cross-correlation coefficient: 


Cn 


< E- < E »< I- < I » 


(t{E)(t{I) 

< El > - < E >< I > 
cr(E)a( I ) 


(9) 

( 10 ) 


where E is an example image and 7 is a new image that 
is already normalized in pose. < . > denotes the average 
operator and <r(.) the standard deviation over the image 
intensity values. The value range is C'n E [—1.0, 1.0]. 
If E and I are identical we have complete positive cor¬ 
relation C)v = 1.0. If C'n ps 0.0, then the images are 
uncorrelated. The use of this standard technique is sug¬ 
gested by the good results reported for face recognition 
by other researchers (cf. [6, 13, 28]). 

Our implementation performs normalized linear cross¬ 
correlation within a multiresolution hierarchy. Correla¬ 
tion is computed between corresponding levels of Lapla- 
cian pyramids for the new images and the examples for 
the subregion. Computation starts at a high pyramid 
level at low resolution. Cross-correlation is performed 
for a small window of horizontal and vertical shifts. For 
each example image the optimal correlation value within 
the shift window is chosen to achieve more robustness 
against small distortions. The location of this optimal 
correlation is propagated to the next lower pyramid level 
and defines the center of the shift window at the next 
higher resolution. The size of the shift window may ei¬ 
ther be constant for all pyramid levels or may increase 
with the spatial resolution. 

The results obtained so far are encouraging. Experi¬ 
ments have been conducted using about 20 example im¬ 
ages either for the whole face or for the same subregion 
around the eye. The new images (taken from a mo¬ 
tion sequence), for which the nearest neighbor had to 


be found, were similar to one of the examples, but were 
not included in the example database. All face images 
were previously normalized in pose using the robust al¬ 
gorithm presented in Section 4.1. For all test images 
the hierarchical template matching algorithm picked the 
image as a nearest neighbor that appeared most similar 
to human observers. Robustness against small residual 
shifts between the images is achieved by choosing the 
center of the shift window according to the optimum at 
the previous level. Instead of using different frequency 
bands within a Laplacian pyramid we also tried corre¬ 
lation using gradient-magnitude images within a Gaus¬ 
sian pyramid of the images. This kind of preprocessing 
before correlation has been reported to yield superior 
recognition results [13]. There was no difference in the 
chosen nearest neighbors. However, differences in the 
normalized correlation coefficients were less pronounced 
for images with added random noise. 

4.3.2 Fit error of the pose model 

Another approach to find the nearest neighbors is more 
closely related to the automatic pose estimation algo¬ 
rithm described in Section 4.1 and Appendix A. The 
general idea is to make use of the displacement vector 
Helds between a new image and the example images. 
The most similar example will be the one having the 
smallest sum of vector magnitudes taken over the entire 
region. To be more specific, what is used is the remain¬ 
ing displacement vector field after aligning both images 
according to the constrained affine transformation that 
takes care of different poses. In other words, the infor¬ 
mation used to asses similarity between images is the 
sum of squared differences between the measured dis¬ 
placement field and the fitted affine displacement field 
at each pyramid level (see also (5)). If the variation be¬ 
tween two images is only due to different poses as defined 
by the affine model, then both displacement fields will be 
identical and both images will be assessed to be similar. 

Two ways to compute this similarity measure have 
been considered. The simplest way is to discard the 
weights assigned to each displacement vector (express¬ 
ing confidence in the data) and to compute the homo¬ 
geneous error of the fit. The formulas for doing so are 
derived in Appendix A.5. The second way is by comput¬ 
ing the weighted errors of the fit as derived in Appendix 
A.6 for the three different model cases. The error can be 
expressed in terms of the estimated model parameters. 
For the weighted error only two additional sums over 
the squares of the measured displacement components 
have to be computed in addition to the terms already 
computed to estimate the model parameters. This facil¬ 
itates an efficient implementation. The summed errors 
still have to be normalized to account for different im¬ 
age sizes or for the individual weights. The criterion 
used to asses image similarity is the mean deviation of 
the measured displacement field from the fitted affine 
displacement field, i.e., expressed as the variance (see 
Appendices A.5 and A.6 for details). 

Some experimental results of this approach will be dis¬ 
cussed now. The last two columns give in Table 2 give 
the weighted and the homogeneous variance for the im- 
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age pair in Figure 5; so does Table 4 for Figure 8. These 
data suggest the following generalizations. The variances 
are bound between a theoretical lower limit of 0.0 and 
an upper limit of about 1.5 that is due to the gradient 
technique used to compute displacement vectors. There 
are two exceptions. Firstly, the weighted variances at the 
highest pyramid level are usually larger than at the next 
lower level. This is because the initial pose estimation 
is not accurate enough and the variance is dominated 
by errors due to misalignment. Secondly, at the high¬ 
est resolution the variances are in general smaller than 
at the previous level. This phenomenon may have two 
explanations. Either the images are very similar and 
the alignment is significantly better at highest resolu¬ 
tion; the images are rather different and the computa¬ 
tion of the displacement vectors fails because the conver¬ 
gence range of the gradient algorithm is exceeded. This 
suggests using only the significant intermediate pyramid 
levels to asses image similarity. Indeed, comparing the 
data in Tables 2 and 4 shows that the variances for the 
significant levels are always smaller for the more similar 
images in Figure 5 that for the images in Figure 8 which 
exhibit rather different facial expressions. 

Previous results (see Section 4.1.5) indicate that even 
better alignment can be achieved at a given pyramid 
level if more than one iteration of the pose estimation 
algorithm is performed. Table 3 and 5 give the variances 
for multiple iterations. The variances tend to decrease 
slightly if more iterations per level are done. Although 
only the data for the bottom level is given in the Tables, 
this result holds also for intermediate levels. 

4.4 Reconstruction of face images 

4.4.1 Blending patches of interest together 

We now consider the problem of reconstructing a com¬ 
posite face image from a patchwork of example subre¬ 
gions. 

In computer graphics texture mapping techniques 
have been applied to map an image that is a frontal 
view of a face onto a 3-D wire frame model of the sur¬ 
face. Quite realistic animation of facial details such as 
eye movements or speech can be achieved by blending se¬ 
quences of eye and mouth subimages into a base image at 
appropriate positions prior to mapping (see for instance 
[24]). Both base and subimage are static frontal views 
of the face. 

As observed by Duffy [24], simply pasting a subim¬ 
age, e.g., a rectangular region of the mouth, yields fairly 
unsatisfactory results due to visible discontinuities at 
the edges of the pasted area. These discontinuities are 
caused by variations in brightness and color between 
base and subimages as well as by minor changes in fa¬ 
cial shape (misalignment) occurring for real subjects. To 
remedy these shortcomings Duffy proposed a transition 
zone located around the bounding box of the rectangular 
subimage. Within this transition zone the values of the 
composite image are computed by weighted averaging 
between corresponding values of base image and subim¬ 
age. A weighting function having linear dependence in 
position is applied. In this way, significantly better ani¬ 
mation, i.e., less spurious effects, can be obtained. 


However, this simple way of blending a subimage into 
a base image is not flexible enough for our demands. Let 
us mention only its most important shortcomings: it re¬ 
quires a rectangular region of fixed size and location, the 
width and location of the transition zone is very criti¬ 
cal in order to achieve realistic results, and there is no 
straightforward generalization to many (possibly over¬ 
lapping) subimages because of geometrical constraints. 

We will now explain a more general algorithm for 
blending, i.e., seamlessly merging, several image regions 
to form a composite image. The essential requirement is 
to preserve important details of the individual source im¬ 
ages without introducing artifacts by the blending pro¬ 
cess. Two factors are relevant for choosing the width of 
the transition zone. If the transition zone is narrow as 
compared to the image features, then the boundary will 
still be noticeable in the composite image, although it 
will appear blurred. On the other hand, if the transition 
zone is too wide, then features from several source images 
may appear superimposed, similar to a multi-exposure in 
photography. These conflicting requirements cannot be 
fulfilled simultaneously in general, i.e., for images cover¬ 
ing a wide range of spatial frequencies. A suitable tran¬ 
sition width can be found only if the spatial frequency 
band of the images is relatively narrow. 

To overcome this problem Burt & Adelson [18, 19] 
proposed a multiresolution approach for merging images 
5 . First, each source image is decomposed into a set 
of bandpass filtered component images. In the next 
step, the component images are merged separately for 
each band to form mosaic images by weighted averaging 
within a transition zone. The transition width corre¬ 
sponds approximately to a half wave length of the band 
center frequency. Finally, these bandpass mosaic images 
are simply summed to obtain the desired composite im¬ 
age. Thus, the transition zone always matches the size of 
the image features. This technique has been formulated 
for pairs of static source images and demonstrated to 
yield superior results over simpler techniques in several 
applications [18, 19]. 

We adopt this idea and give a more general formula¬ 
tion that applies to any finite number of source images 
and to time sequences of images. The subregions of face 
images will be called patches of interest (POI). As op¬ 
posed to rectangular regions of interest (ROI) a POI may 
have arbitrary shape. Moreover, several POIs may over¬ 
lap and can be arranged in a stack. In this pseudo 3-D 
structure only the top patches contribute to the compos¬ 
ite image. 

Our blending algorithm takes two kinds of inputs. 
Firstly, an indexed set of subimages compatible with the 
corresponding POIs. These subregions of facial exam¬ 
ple images are previously normalized in pose. These 
examples comprise, among others, base images of the 
face under various off-plane rotation views and a variety 
of subimages of facial details like different mouth shapes 
and different states of eye movement and eye blinks. Sec¬ 
ondly, a sequence of index images that describe the com¬ 
posite face image appearance over time — thus both have 

5 Amnon Shashua provided valuable contributions to our 
discussion and the relevant papers. 


13 



the same size. Each pixel value in these index images 
refers to the POI that should dominate the composite 
image at the corresponding position. 

The procedure consists of the following steps: 

1. For each example subimage associated with the 
POI having the index number i generate a Lapla- 
cian pyramid LEi consisting of bandpass filtered 
images LE\ using the procedure described in Sec¬ 
tion 4.1.1 6 . This has to be done only initially and 
the pyramids can be stored for fast access. The 
storage requirement is only 4/3 of the original im¬ 
age. 

2. For each index image X n in the time sequence col¬ 
lect the set of index numbers Af of all referenced 
example subimages. Simultaneously, build a binary 
mask image Mi for each index number. A pixel is 
assigned the value 1 at positions having the cor¬ 
responding index value in the index image and 0 
everywhere else. 

3. Generate a Gaussian pyramid GMi for each mask 
image Mi included in the index set Af. 

4. The entries in the GMi pyramid are used as weights 
for the corresponding bandpass filtered example 
subimages in LE\. For each band (level l of the 
pyramid) a mosaic LC l image is computed in the 
following way: 

LC l = J2 GM '- LE l 

ie X 

where the sum is taken only over the examples in¬ 
cluded in the index set Af. This significantly in¬ 
creases efficiency if a large number of potential ex¬ 
ample subimages is used as required for realistic 
animation. 

5. Finally, the procedure of generating the Laplacian 
pyramids is reversed to obtain the composite image 
C. This is achieved by the following iterative pro¬ 
cedure starting the highest level N of the mosaic 
pyramid LG: 

GC 1 - 1 = LG 1 - 1 + expand GC 1 , 
where GC N = LC N and C = GC°. 

Since the POIs associated with each index image may 
change between frames, it is desirable to have a smooth 
transition between successive frames in an animation. 
An adequate way to accomplish this is to apply a low- 
pass filter to the binary mask image Mi before generat¬ 
ing the Gaussian pyramid GMi. Weighting the past few 
mask images with an exponentially decaying weighting 
function can be implemented very efficiently in a recur¬ 
sive way (see [60] for algorithm). However, application 
of such a filter may require an additional normalization 
step of the pixel values in the mosaic images. This is 
because in general it cannot be guaranteed that the sum 
of all weights for each image location is equal to unity. 

The above sketched algorithm has been implemented 
and a very promising, realistic animation of face images 

6 Indication of the individual pyramid level l is omitted if 
procedures are applied homogeneously to all levels. 


have been obtained. This approach to blending images 
could also be successfully combined with texture map¬ 
ping techniques. 

4.4.2 Recovery of original pose 

The last processing step before displaying the recon¬ 
structed face image is to transform the image reassem¬ 
bled from normalized examples to the pose of the original 
input image. This requires reversal of the transformation 
of the pose compensation performed on the sender side 
from the transmitted pose parameters. Inverting the 
mapping of the image warping (that is computing the 
mapping from image 2 to image 1 if the mapping from 1 
to 2 is given) in not trivial in general [64]. However, due 
to the parametric model applied here, the parameters for 
warping the normalized pose face image to the original 
pose can be computed easily from the transmitted pose 
parameters describing the alignment with the reference 
image. 

In Appendix B a closed-form solution is derived to 
obtain the inverse mapping parameters from the original 
parameters. To demonstrate the inverse mapping, in 
Figure 13 the reference image depicted in Figure 5 is 
warped towards the pose of the right image; this is the 
reverse of the pose compensation. 

Figure 14 summarizes the processing steps of a sim¬ 
plistic video-conference system: 

• normalizing the pose of a new face image, 

• finding the most similar example out of a database 
of normalized images, 

• reconstructing the face image given an index num¬ 
ber (in the database) and inversion of the pose nor¬ 
malization. 

4.4.3 Interpolation between examples 

In this section we point out possible extensions of the 
system architecture outlined in Section 4. Instead of 
using the nearest neighbors only, it is natural to “inter¬ 
polate” novel views between examples from the database 
as already mentioned in Section 3.1. Extending previous 
results [45, 47, 48, 46], recent work of Beymer, Shashua 
& Poggio [10] presents the mathematical formulation and 
experimental demonstrations of several versions of such 
an approach. The feasibility of interpolation between 
images has been successfully demonstrated for the mul¬ 
tidimensional interpolation of novel human face images. 

So far, the examples are selected manually for train¬ 
ing. For applications in video-conferencing the process of 
picking adequate examples from the database for subse¬ 
quent interpolation obviously has to be automated; this 
becomes an even more significant issue for higher dimen¬ 
sional interpolation. Various strategies are conceivable 
and we will suggest some — still subject to experimental 
evaluation. For ID interpolation (morphing between a 
pair of similar images) an exhaustive search for the two 
database images that allow for the best interpolation re¬ 
sult, i.e., the result that comes closest to the novel image, 
appears reasonable. But, for higher dimensional inter¬ 
polation (at least four examples are needed for 2-D in¬ 
terpolation) this strategy seems to be prohibitive due to 
excessive computational costs (combinatorial explosion). 
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In order to reduce the search space for the interpola¬ 
tion basis, i.e., the examples that span the space of pos¬ 
sible interpolated views, we suggest the following strat¬ 
egy: i) Find the nearest neighbor to the novel view in the 
database (cf. Section 4.3). ii) Restrict the search space 
to images that are within a certain “distance” from the 
novel view or the nearest neighbor; the later is more ef¬ 
ficient since the distances between all examples can be 
precomputed. Of course this presents the problem of 
finding a suitable metric to define the distance. Poten¬ 
tial candidates are provided by the algorithms described 
in Sections 4.3.1 and 4.3.2. However, other (e.g., fea¬ 
ture based) metrics used for recognition should also be 
considered, iii) The maximum distance could be cho¬ 
sen to contain only the number of examples required for 
interpolation of a given dimensionality. Note that this 
approach may result in ambiguities if more than one ex¬ 
ample image has the same distance — this cannot be 
excluded in a high-dimensional space. Alternatively, one 
can chose the maximal distance so that more than the 
required examples are included in the search space. Sub¬ 
sequently, the number is reduced by abandoning redun¬ 
dant images, for instance using a leave-one-out strategy 
that keeps only the examples providing the best result. 

The argument for the strategy of taking the nearest 
neighbor as a interpolation basis is not obvious and re¬ 
quires a better practical understanding of the interpola¬ 
tion algorithms (see [10]). The algorithm consists of two 
major steps. First, correspondence vector fields are esti¬ 
mated, which capture the geometrical relations between 
the novel image and the examples in the best possible 
way. In the second step, the texture information is pix- 
elwise interpolated between the intensity values in the 
examples referenced by the correspondence vector fields. 
In theory, optimal vector field interpolation for the first 
step in general cannot be obtained using just the nearest 
neighbors. On the other hand, because of illumination 
effects, occlusions and distortions it is likely that the 
nearest neighbors contain the most similar texture. 

The interpolation technique has been applied to im¬ 
ages of whole faces. A natural extension is to apply 
the interpolation technique separately to patches of in¬ 
terest (POI), as proposed in Section 3.5. By means of 
the generalized blending technique described in Section 
4.4.1 new views can be composed. We expect a large 
potential in combining these two concepts. Here are two 
examples: i) The location of the iris and pupil of the eye 
(as it changes with the direction of gaze) may be inter¬ 
polated from four examples or even from two examples if 
we disregard the minor vertical movements. Additional 
example sets may be used to account for varying pupil 
size, ii) Realistic synthesis of eye blinks may require not 
many more than two examples. 

We will now suggest a further extension of the multidi¬ 
mensional interpolation algorithm. So far, the same co¬ 
efficients are used for interpolating an approximated ge¬ 
ometric relation (correspondence vector fields) between 
the examples and the novel image as well as for the pixel- 
wise interpolation of the intensity (texture) information 
(see Sections 4.1. and 4.2 in [10]). These coefficients are 
estimated to yield the “best” possible approximation, 


e.g., in the least square sense, for the correspondence 
vector field of the novel image with respect to the ex¬ 
ample^). While this coupling between the coefficients 
for geometric and texture interpolation makes sense for 
certain applications of ID interpolation, e.g., for frame 
rate conversion (see [8]), it is not necessarily the best 
approach for more general cases. 

We suggest exploiting the freedom of adjusting the 
coefficients for geometric and texture interpolation inde¬ 
pendently. A straightforward way to do this is to esti¬ 
mate the optimal coefficients for geometric interpolation 
first (as before) and subsequently to optimize a second 
set of coefficients for texture interpolation. The second 
step uses the previously recovered geometric relations to 
access the intensity information at corresponding loca¬ 
tions in the example images. The coefficients for the 
second step could be found by least square minimization 
applied to the intensity values, for example. While only 
doubling the number of parameters, we expect even more 
realistic “rendering” of novel images, especially if the ex¬ 
ample images are captured under small variations in the 
illumination (direction and intensity). A further amend¬ 
ment could include “virtual” example images that could 
partially compensate for small lighting changes. In the 
simplest case, the coefficients for a completely black and 
a white image could be used to adjust the average bright¬ 
ness of the synthesized image. More elaborated versions 
would use several additional “virtual” examples showing 
slowly varying intensity in different directions. Adjust¬ 
ing the corresponding coefficients will to some extent 
simulate changes in illumination direction. Of course, 
this is not correct in the strict physical sense that would 
require multiplication of the surface reflection by the il¬ 
lumination intensity, where both may be functions of the 
relative angles. However, for small variations the linear 
compensation will give reasonable results at a very low 
computational cost. 

Two recent achievements should be mentioned in the 
context of generating new views from a small number 
of model views or example images. Shashua & Toelg 
[53] showed that a nominal quadric transformation for 
all image points, i.e., a transformation assuming that an 
object surface can be approximated by a quadric, can 
be successfully applied to register two face images. All 
parameters of the transformation (the quadric and the 
relative camera geometry) can be recovered from only 
nine corresponding points over two views [53, 54]. Al¬ 
ternatively, the transformation can be recovered using 
only four corresponding points and a given conic in one 
view (encompassing the face in our application) [54]. 

This algorithm is relevant here for two purposes. 
Firstly, it can be used as a preprocessing step to facil¬ 
itate pixelwise correspondence, i.e., bringing two views 
into closer alignment. This step is essential for views 
that are too different to be directly accessible to stan¬ 
dard dense correspondence algorithms; a small number 
of distinct feature points can easily be found in the two 
initial views. Secondly, the transformation according to 
the quadric surface model is described by a few param¬ 
eters (at most 17 are needed). In a video-conference 
system only these parameters need to be transmitted in 
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order to synthesize novel views of a face from one given 
example, provided that the views are not to different 
(self-occlusion, etc.). The nominal quadric transforma¬ 
tion is significantly more general than simpler transfor¬ 
mations commonly used (e.g., affine, or transformation 
due to a plane) and superior registration results have 
been obtained with face images (see [53, 54] for exam¬ 
ples). 

The second achievement is a generalization of the “lin¬ 
ear combination of views” result obtained by UUman 

6 Basri [62] that relates three orthographic views of a 
3-D object (ignoring self-occlusion). Recently, Shashua 
[52, 51] proved that the image coordinates of correspond¬ 
ing points over any three perspective views (uncalibrated 
pinhole camera) of a 3-D object are related by a pair of 
trilinear equations. The 17 independent coefficients of 
this trilinear form can be recovered linearly from 9 cor¬ 
responding points over all three views. Once these coeffi¬ 
cients are recovered and full correspondence between the 
two model views has been established, the correspond¬ 
ing locations in the novel (third view) can be obtained 
uniquely for all other points. 

The direct approach of using the trilinear result to 
generate new views has several theoretical advantages 
over classical structure from motion methods as well as 
over methods to recover non-metric structure (see [51] 
for a detailed discussion). Moreover, processes that are 
known to be unstable in the presence of noise, such as 
recovering the epipolar geometry, are avoided. The tri¬ 
linear algorithm proved to be significantly more stable 
in the presence of errors in the image measurements. 
So far, the trilinear algorithm has been evaluated only 
in computer simulations and using re-projection of iso¬ 
lated points in real imagery, though an implementation 

to transform dense images is planned for the near future 

7 

Although [52, 51] emphasis is given to the task of 
recognition of 3-D objects, the trilinear method may 
have interesting applications in an example-based video- 
conference system. Only 17 parameters are needed to 
represent a new view with respect to two model views 
— as opposed to only one model view for the nominal 
quadric transformation. The two model views, or exam¬ 
ples as we called them earlier, are available on the sender 
and the receiver side. The same algorithm for achieving 
full correspondence between these reference images is ap¬ 
plied on both sides. To encode a third view, the sender 
solves for the 17 parameters by using many points to in¬ 
crease robustness; this can be done using a least squares 
approach. Transmitted, however, are only the 17 param¬ 
eters needed for reconstruction. 

5 Conclusions and outlook 

The concept of an alternative approach to video- 
conferencing that is sketched in the first part of this 
paper appears to be very promising. Several algorithms 
have been presented that form modules in a system ar¬ 
chitecture. Each of these modules has proved to be 
robust under realistic conditions. Much further work 

7 Amnon Shashua, personal communication, March 1994. 


on integration and refinement of the system is required. 
Once a more elaborated system based on our approach 
is available, it will be interesting to compare its perfor¬ 
mance with state-of-the-art systems utilizing traditional 
model-based approaches. 

For an automatic video-conference system, i.e., a sys¬ 
tem that does not require any human intervention, ad¬ 
ditional components are obviously required. However, 
many suitable algorithms are already known and de¬ 
scribed in the literature. The most important compo¬ 
nents are briefly discussed in the sequel. 

The separation of the face and the uncovered back¬ 
ground can be achieved by the methods sketched at the 
end of Section 4.2. Recently an algorithm for human face 
segmentation by fitting an ellipse to the head has been 
described [56]. This algorithm is robust enough to deal 
with images having moderately cluttered backgrounds. 

Another, more critical problem is the automatic selec¬ 
tion and positioning of the POIs in the face image. This 
task is significantly simplified by the robust pose nor¬ 
malization presented in this paper. In the normalized 
face images knowledge about the average facial geome¬ 
try is easily applicable to define relevant regions. For 
individual faces, regions of high surface structure can be 
detected by means of texture analysis techniques (e.g., 
high gradient or high spectral energy). The generalized 
blending algorithm described in Section 4.4.1 to some 
extent smooths out visible discontinuities at edges be¬ 
tween adjacent POIs. R is desirable, however, to locate 
boundaries within regions of low surface texture and not 
at conspicuous facial features — loosely speaking, we 
apply a reversed edge detector. For this purpose ori¬ 
entation selective filters (like Gabor filters, wavelets, or 
steerable filters [27]) may be the way to go. They make 
it possible to seek for appropriate locations depending 
on the orientations of boundary lines that are approxi¬ 
mately given. 

An important step is the automatic acquisition of the 
example database. Here at least two distinct tasks have 
to be distinguished. During the initialization phase ex¬ 
amples have to be acquired that span the largest possible 
range of facial expressions and poses. However, extreme 
poses and expressions may be disregarded at this stage. 
Moreover, two cases have to be considered. In the stan¬ 
dard case no prior examples for a person are available 
when a person uses the system for the first time. Then 
all examples have to be transmitted initially using con¬ 
ventional image compression, e.g., JPEG. In the second 
case prior examples are available from previous sessions. 
New examples have to be acquired on the sender side 
and it has to be determined which of the old examples 
are still compatible with the current situation. This is 
necessary to update changes in facial hair, for example. 
The gain is that usually only some new examples may 
have to be transmitted to the receiver side. However, 
the computational cost of the evaluation on the sender 
side may be higher. 

During the subsequent transmission phase of the ses¬ 
sion the task is somewhat different. In general, novel 
images should be approximated in terms of the examples 
with sufficient accuracy. Precautions should be taken to 
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detect whenever the best possible reconstruction (based 
on the available examples) is not satisfactory. In these 
rare cases, e.g., for unusually strong expressions, (com¬ 
pressed) new image data have to be transmitted. At this 
point it is not clear whether this additional image data 
for non-standard cases should supplement the standard 
database as an additional example; this has to be sub¬ 
jected to experimental evaluation. 
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A Hierarchical estimation of global pose from local displacements 


We assume an affine model for the displacement vector field. The affine displacement held d(x;) is determined at 
any image location x; = (*;, t/;) T by six model parameters: 


d(xj-) = Ax; + t 


with 


A = 


Jy L,y 


and t = 


( 11 ) 


( 12 ) 


Suppose we have n image points x; £ £2 (i = 1, . . ., n) with a measured displacement d(x 8 ) = ( d x (xi , t/;), d x (x{, yi)) T ■ 
We obtain an overdetermined linear system with 2 n equations and six unknowns that can be written in matrix 
form: 


Mp = d 
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In general this system cannot be solved exactly. Instead, we want to find the parameter vector p that minimizes 
the error between the measured displacement held d(x 8 ) and the htted affine displacement held d(x 8 ). We assume 
Gaussian statistics of the process and use the i 2 -norm as a distance measure. So, we want to hnd 


mine = ^ w} d(x 8 ) - d(x 8 ) 


(15) 


where we allow for a weight wj for each data point x;. These weights may account for the conhdence that we 
associate with the measurement d(x 8 ). 
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w n ) 


the solution of the minimization problem (15) formally is 

p = M*d, 

where M* = (M T W 2 M)“ 1 M T W 2 is the weighted pseudo-inverse. This can be written as 

M T W 2 Mp = M T W 2 d =>- Bp = 1) 

"-V- ' "-V-' 

B b 


(17) 

(18) 


Now, b £ IR 6 and B £ H 6x6 is a square matrix. The system can be solved for the parameter vector p = B : b 
provided that det(B) xjz 0 and therefore the inverse B -1 exists. For reasons of numerical accuracy and stability one 
would generally prefer to solve the overdetermined system by means of the computationally more costly singular 
value decomposition (SVD) (cf. [49]). However, our relatively simple model is well-behaved and it turn out that in 
the implemented case the matrix inversion is trivial. 
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A.l General case 

In the general case of an affine displacement field with six free parameters we have p 6 = Bg 1 b 6 with 

( 12 wf E Y.wfm '■ \ 
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Y wf yi Y w f x i Vi E w f y'i 
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A.2 No shear 

If we admit only translation, scale and rotation and do not allow for any component of shear, A in (12) takes the 
form 

A = SR — I (21) 


cos a sin a 
— sin a cos a 


and I = 


Consequently (11) becomes 

d(xj-) = (SR - I)x; + t. 

Therefore, we have the constraint for the parameters: 

b x = c y = s ■ cos a — l and — b y = c x = s ■ sin a. 
The scale s and the angle of rotation a in the image plane are derived from A as 

/1 , / .—; 7 T i (b x + l\ 


■s = det(A + I) and a = arccos ^- J . (25) 

Of course, s = 1 and a = 0 yields a constant displacement held. 

With the constraints in (24), the linear system (18) can be simplified by adding the 2nd and 6 th row and subtracting 
the 5th from the 3rd row of (19) and (20). This gives p 4 = BJ 1 b 4 with 
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A.3 Pure translation 

If we admit only pure translation ( c y = b x = b y = c x = 0) the displacement held is of course constant d(x 8 ) = t. In 
this case we have B 2 P 2 = b 2 , where 


m) ■ -=(«:) <») 

are derived from the 1st and 4th row of the general case equation system (19) and (20). Since B 2 = Y w f~l we can 
solve B 2 P 2 = b 2 directly: 

1 . 


which is equivalent to 
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and a y = 
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A.4 New reference frame 


As we will see, several expressions can be simplified significantly if we express the affine displacement field in a new 
frame of reference. In order to do so, we introduce a new function 

d'(x') = A'x' + t' = Ax; + t = d(x 8 - ) (31) 


expressing the affine displacement held in terms of A' and t'. We performed a coordinate transform, so that the 
origin of the new reference frame coincides with the center of gravity of the image: 


x,- = x,- — x 


with x = 


£' 


£' 


(32) 


In the new reference frame the equation systems take a much simpler form, since 


w i Xi ~ * w i = °- 


Using (31) and (15) we now want to minimize the expression 


e 
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(34) 


with respect to the new parameters a' x ,b' x , c' x and a' y ,b' y ,c' y . 

A.4.1 General case 

From (31) and (32) we can directly derive the old parameters from the new ones: 

A'x) + t' = A'(x s ' — x) + t' = A'xj- + (t' — A'x) = Axj- +1 
Comparison of the coefficients on both sides yields (and similarily for a y ,b y , c y )\ 

a x = a' x - b' x x - c' x y , b x =b' x and c x = c' x . 
Because of (33), in the new reference frame (19) and (20) simplify to 
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and by inversion the remaining 2x2 matrix we get the solutions for b' x , c' x in closed form: 
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The solutions for a' y ,b' y , c y are found similarly. 
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A.4.2 No shear 

In the new reference frame (26) and (27) simplify to 
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and 
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B 4 is now diagonal and = h' A can be directly solved for the parameters: 
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A.5 Homogeneous error of the fit 

In order to assess how well the estimated affine motion d'(x() describes the measured displacement Held d(x 8 ) let us 
now derive the homogeneous error of the fit. Setting all weights wj = 1 we obtain from (34): 


be = Y^ d(x;) - d'(x') = ( d ( x 0 “ ( A ' x '- + *'))" 


Note, that because of the I^-norm, e can be decomposed into the errors of the x and y component: 

he = he x + he y . 


(47) 


(48) 


The mean deviation of the measured displacements from the fitted affine displacement field is estimated by the 
homogeneous variance 

he 

var h = -(49) 

n — l 

In the general case we obtain for the error of the x component: 

he x = ^2dl(xi,yi) (50) 

— 2aj, 'y d x (xi , yi) — 2b x ^ ) d x (xt , yi)x—2c x ^ ) d x (xt , j/i)t/j 
+ ^ a 'xK ^2 x i + 2 a ' x e' x ^2 Vi + 2b' x c' x ^2 x \y'i 
+ a ' x n + K ^2 x i + c 'x ^2 y'i ’ 

and similarly for he y , where n is the number of data points. 

A.6 Weighted error of the fit 

The mean deviation of the measured displacements d(x 8 ) from the fitted affine displacement field d'(x() is estimated 
by the weighted variance approximated by 

var^, PS ™ (51) 

E w i 

From (34) we get the weighted error of the fit: 

we = ^2 w i |d( x i) — d , (x(-)|| = ^w s 2 (d(xi) - (A'x' +t')) 2 . (52) 


Note, that we can be decomposed again into the errors of the x and y components: 


we = we x + we y . 
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( 53 ) 



( 54 ) 


A.6.1 General case 

In the general case we get for the error of the x component: 
we X = y^wjdl(xj,yi) 

— ^ a x 'y ' w i d x {%i, Vi) — 2 b x y> Wf d x (xi , yi)x i — 2c x ^ d x {xi, yi)y , 

+ Za’xK Y W f X i + 2a x C x Y W "i y 'i + 2b 'x C 'x W i X 'di 

+ a ' X 2 Y w2 i + hw2 i X 'i 2 + c 'x Y w2 i y 'i 2 ’ 

and similarly for we y . Since ^ w i x i = X) wfy'i = 0 and with the solution for a' x several terms cancel out and we get 

We x = Y W i d x( X ^y^ +‘ 2b 'x c 'xY W i X i y i ( 55 ) 

- %K Y w2d x( x i > Vi) x 'i - 2 4 Y w2 i d x{ x i,yi)y'i 

~ a 'x 2 Y ^ + b '* 2 + C 'x 2 Y W 'i y i 2 ’ 

A.6.2 No shear 

With c y = b x and b y = —c x we get from (53) and (55) 

we = Y w2 i d l( x i>yi) +Y w2 i d2 y( Xi ' Vi ) ( 56 ) 

- 26* w^d x (xi , yi)x\ + Y w l d y( x i, ViWi 

~ 2 c' x wf d x (xi , yi)y\ - Y wfdyixi^^x^ 

- ( a 'x + a y )Y/ W i + ( b 'x + c 'x )(^ W J 2 *) + Y 

This can be further simplified using solutions for b' x , c' x : 

we = Y w i d l( x i>yi)+Y w i d y^ Xi ’ y ^ 

- ( a'x 2 + a 'y 2 ) Y W i - ( b 'x + C x 2 ) (Y w2 i X i 2 + W * ^ 

A.6.3 Pure translation 

In the case of pure translation (c y = b x = b y = c x = 0) we see from (36) that a x = a! x and a y = a' . The error of the 
fit is then given by 

we = Y w i d l( x i, Vi) + Y w i d "y( x i ! Vi) - ( a l + a y) Y W i ' ( 58 ) 


2 / 2 
Wi Vi 


(57) 


A.7 Propagation of affine parameters from coarse to fine levels 

In the sequel we derive, how to propagate the affine motion parameters from a coarse pyramid level to the next finer 
level. The affine displacement Held 


d'(x') = A'x'+t' 


(59) 


at the level l of the pyramid is determined at any image location x( = (x\, t/)) T by six parameters: 


For the previous coarser level l 


b l c l \ 

A‘ = I If C f ) and t ; = 

y y J \ “j/ 

1 with lower resolution we have accordingly 
d' + 1 (x' +1 ) = A ;+ 1 x ' +1 +t' +1 . 


(60) 

(61) 


Now, since the sampling grid of level l has twice the density than at the coarser level / + 1, the coordinates of 
corresponding points and their displacements are related by 

x!-=2-x'+ 1 and d'(x') = 2 • d'+ 1 (x'+ 1 ). (62) 

Inserting this into (59) leads to 

d' +1 ( x ' +1 ) = A'x' +1 +h l . (63) 

Comparing this with (61) yields 

A' = A ;+1 and t' = 2-t' +1 . (64) 

Therefore, to propagate the affine parameters from level / + 1 to the next finer level l we have only to double the 
translation vector t i+1 . The coefficients of matrix A i+1 remain unchanged. 
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A.8 Combining the affine parameters at one level 

The combination of the initial affine parameters on level l (propagated from level l + 1) with the parameters of 
the residual affine motion on level l (estimated from the residual OF at level l) is now considered. The refined 
displacement held d(, at level l is given by 

d l r = d l + d l , (65) 

where cr(x() = A ; x( +t l is the residual affine displacement held estimated at level l. With (59) we get because of 
the linearity of the affine transformation: 

d(.(x() = (A'x'- + t') + (A'x'- + t') = (A ; + A')x'- + (t ; + t') = A l r x\ + t[. (66) 

Therefore, the rehned affine parameters (A(., t l r ) at level l are given as the sum of the propagated parameters (A ; , t ; ) 
and the residual parameters (A ; ,r) estimated at level l. 


B Inversion of affine pose transformation 

The transformation dehned in (11) and (12) gives us the displacements from the example image E to the new image 
I. This allows us to warp the face in I towards the normalized pose of the example image E. The position of a point 
x( = (*(, t/() T in I is derived from the affine displacement held d(x;) that maps x; —>■ x( by 

x' = d(xj-) + X; = (A + I)x; + t. (67) 


Here, the prime indicates that we use image 7 as a reference frame. 

On the receiver side we are faced with the problem of reversing this pose normalization. We have a face image E 
with normalized pose (indexed in the database, or interpolated between several examples) and we want to generate 
an image I according to the transmitted pose parameters of our model. We are now looking for the displacement 
held cr(Xj-) that maps x( —>■ x;. For our affine transformation model there is a closed form solution. Provided that 
det(A + I) yl 0 the inverse matrix exists, and we obtain from (67) the location x; in the normalized image E by 


x,- = (A + I) _1 x' — (A + I) _1 t. 

With the dehnition 

d'(x') = A'x' + t' = Xi — x' 


we conclude from (68) that 

A' = (A + I) -1 — I and t' = -(A + I) 
Therefore, expressed in terms of the model parameters, A' is given by 

* / _ ^f 1 Cx A { 1 

det(A + I) \ —by b x + 1 J ^0 


H. 

o 

l 


( 68 ) 

(69) 

(70) 

(71) 
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Figure 3: Depicted arf several facial expressions that may occur during a video-conferencing session. All images 
exhibit a roughly frontal view of the face. Conventional approaches to normalize face images utilize labeled feature 
points, like the pupils of the eyes or the tips of the mouth. The first, image shows a neutral facial expression with gaze 
in the forward direction. This is t.hest.a.nda.rd case that most face recognition systems are designed to cope with. The 
second row demonstrates the movement of both eyes due to changes in direction of gaze (conjugate eye-movements); 
vergence movements (disjunctive eye-movements) alter the distance between the pupils. The positions of these points 
(centers of the pupils) may differ by more than 1 cm on either side of the forward direction. This is a large fraction 
of the. inter-ocular distance of about. 7 cm. The last, row depicts the movement, of the corners of t.hi mouth due to 
skin deformations caused by facial expressions and normal speech. As is evident., estimating the pose based on the 
correspondence of these points is rather unreliable if facial expressions are admissible. Nevertheless, such feature 
points are commonly used for normalizing face images with moderate deviations from neutral expressions. Finally, 
the pupils may entirely disappear when the eyelid is closed during twinkling or blinking. A pose estimation method 
relying on the correct, detection of these feature points would be led astray. 
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Figure 4: Band-pass filtered images of a face (in front of a dark background). A Difference-of-Gaussians (DOG) 
filter approximating a Laplacian-of-Ganssian operator of width a (“Mexican-hat filter”) is applied to the original 
image. The images are arranged in descending order of a starting with a = 64.0 pixels for the upper left image and 
ending with a = 2.0 pixels for the lower right image; a increases by one octave between consecutive images. Facial 
expressions and details of the face (eyes, mouth, etc.) are most conspicuous in the high-frequency images, whereas 
the overall position, size, and orientation of the head appear dominant in the low-frequency bands. 
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Figure 5: Two face images with similar expressions, but different pose and size, used for demonstrating the robust 
pose estimation and compensation algorithm. The left image is the reference image, e.g., stored in the example 
database, the right one is a new frame of the incoming data stream that has to be normalized in pose. All images 
used in the sequel are 255 x 320 pixels in size and digitized with 8-bit. quantization. The focal length of tint camera 
was approximately 16 mm and the camera distance was about. 1.2 nr. The face on the right, side is inclined by 8-9° 
and is about. 10%) smaller than that in the reference image. These values are obtained by direct, reading from the 
images. 



Figure 6: Here t.h# right, image of Figure 5 is transformed to resemble the pose of the reference image. Pose 
parameters have been computed automatically by the algorithm described in Section 4.1. The six images show how 
the results depend on the resolution level where the estimation is terminated. Subsequently, the pose parameters are 
extrapolated to the original resolution and the images are warped accordingly. The lowest, resolution is level 5 (upper 
left); the original image resolution is at level 0 (lower right). By visual inspection it. is obvious that level 3 (upper 
right.) already achieves good alignment, of the face with the reference image. The resulting image becomes stationary 
and estimation at. higher resolutions does not. lead to significant, improvement.. For more quantitative details see 
Figure 7 and Table 2. 
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Figure 7: a x : O a y : + s : □ a : x 

Different graphic representations of the estimated pose parameters given in Table 2. Investigated is the dependence 
of thfjaccuracy on the pyramid level where the estimation is terminated. The diagrams show: a) pose parameters 
at each final level normalized by their bottom values (level 0); b) error to highest resolution estimate normalized 
by the bottom value; c) relative change between successive levels normalized by the value of the lower current level; 
d) relative change between successive levels normalized by the bottom value. The notation is the same as is used 
throughout the text: a x and a y are the horizontal and vertical translation, respectively; s is the scale factor; a is 
the angle of in-plane rotation. The more careful analysis confirms the results already discussed in Figure 6. The 
estimates converge to the values at the highest resolution. At level 2 the estimated parameters are already very close 
to the values obtained at the highest resolution (level 0). The largest changes in the parameters happen already at 
the high pyramid levels with low resolutions (see lower row). 


Table 2: Pose parameters estimated by the algorithm. The values in each row correspond to the images depicted 
in Figure 6, where l is the final level for estimation before the parameters are propagated to the original resolution 
at level 0. The notation for the parameters is the same as throughout the text: a x and a y are the horizontal and 
vertical translation, respectively; s is the scale factor; a is the angle of in-plane rotation in radians. The last two 
columns give the weighted variance (var t( ,) and the homogeneous variance (var/,) as discussed in Section 4.3. 
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Table 3: Pose parameters at the bottom level (1 = 0) depending on the number of iterations i per level estimated for 
the image pair in Figure 5. The estimates vary very little for increasing nurnbdf of iterations. The largest changes 
(for a y and a between i = 1 and 2) are on the order of 5%. var t( , and var/, are the weighted and the homogeneous 
variance, respectively. 


i 

a.c 

a y 

s 

a 

var^ 

var ft 

i 

-41.20 

43.18 

0.9102 

0.1359 

0.9217 

0.7124 

2 

-41.29 

45.41 

0.9076 

0.1465 

0.8998 

0.6925 

3 

-41.34 

46.14 

0.9067 

0.1505 

0.8969 

0.6860 

5 

-41.40 

47.00 

0.9057 

0.1557 

0.8959 

0.6778 

10 

-41.46 

47.40 

0.9055 

0.1580 

0.8984 

0.6778 



Figure 8: Two other face images. This example is somewhat more “difficult” as compared to Figure 5, since the facial 
expressions in both images are quite different. Again, the left image represents the reference pose. An inclination of 
6-7° and an increase in size of about. 7-8%) can be measured directly by comparing both images. 



Figure 9: See caption of Figure 6. The results in this figure are for the two images in Figure 8. By visual inspection 
no significant change occurs for final estimation levels higher than two (lower left). 
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Figure 10: a x :0 a y : + s: □ a: x 

Different graphic representations of the estimated pose parameters given in Table 4. See also caption of Figure 7. 
Again, the parameters converge to the values at highest resolution (upper row) and the largest changes take place 
at the high pyramid levels representing low frequencies (lower row). However, this tendency is not as pronounced as 
in Figure 7. 


Table 4: Pose parameters for the resulting images in Figure 9. See also caption of Table 2. 


l 

a.c 

a y 

s 

a 

var^ 

var ft 

5 

-36.56 

-1.95 

0.9972 

0.0580 

0.4995 

0.5100 

4 

-66.64 

-3.03 

1.0549 

0.0083 

0.3232 

0.7876 

3 

-77.34 

1.74 

1.0683 

0.0516 

0.6429 

1.0447 

2 

-80.90 

4.48 

1.0695 

0.0682 

0.9457 

1.0312 

1 

-81.62 

3.96 

1.0718 

0.0670 

1.4389 

1.0743 

0 

-81.66 

3.93 

1.0721 

0.0669 

1.0295 

0.7762 
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Figure 11: a x : O a y : + s : □ a : x 

More detailed analysis of the parameters depending on the number of iterations (1, 2, 3, 5, respectively) per level 
before the estimation is continued at the next higher resolution level. The values are normalized by the bottom 
value. Table 5 gives the corresponding absolute parameter values. Diagram a) is identical to Figure 10 a). The 
major improvement is achieved by performing two iterations instead of one pass since the convergence to the bottom 
values becomes smoother. 


Table 5: Estimated pose parameters at the bottom level (l = 0) as functions of the number of iterations i per level. 
The values correspond to the left data points in Figure 11. Note the significant change in a y and a between i = 1 
and 2, whereas the variation for larger number of iterations is fairly small. 


i 

a.v 

a v 

s 

a 

var^ 

var ft 

i 

-81.66 

3.93 

1.0721 

0.0669 

1.0295 

0.7762 

2 

-87.49 

12.88 

1.0728 

0.1135 

1.0090 

0.7641 

3 

-88.70 

14.17 

1.0737 

0.1204 

1.0163 

0.7668 

5 

-89.00 

13.82 

1.0740 

0.1191 

1.0126 

0.7652 

10 

-87.62 

11.33 

1.0716 

0.1056 

0.9987 

0.7618 


32 













Figure 12: Illustration of the results depending on the number of iterations per level: left i = 1; middle i = 3; right 
i = 10. Careful visual inspection is required to perceive the differences although the values in Table 5 differ. 



Figure 13: The image transformation can be inverted due to the parametric model used here. Given the pose 
parameters used to align the new image with the reference image, the inverse parameters can be computed as 
described in the text. Note that this is not possible for a general mapping. For illustration, the reference image in 
Figure 8 is transformed to the pose of the new image, i.e., the mapping is done in the reverse direction. 



Figure 14: Sequence of images to exemplify the processing steps of a simple video-conference system. The left 
image is a newly acquired video frame. The middle image is the most similar normalized example image found 
in the database. The pose parameters of the new image with respect to the reference example are estimated and 
transmitted together with the index number for the example. On the receiver side the stored normalized example is 
transformed towards the pose of the new image on the sender side (right image). 
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Figure 15: Several images to evaluate and demonstrate the robustness of the algorithm and the range of poses it 
can deal with. The horizontal and vertical translation in the first two rows amount to about 110 and 50 pixels, 
respectively. In the third row an in-plane rotation of 15-20° to each side (measured from the vertical line) can be 
read from the images. Finally, the variation in size in the last row is in the range of 25%). Other examples with 
different backgrounds and a variety of distinct poses and facial expressions can be found in [58]. 
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Figure 16: Pose compensated versions of the original images depicted in Figure 15 are in corresponding positions. 
The image in the first row is th$ reference image defining the intended pose. The rather distinct facial expression as 
compared to the new images is noteworthy. These results are obtained with only one iteration (i = 1) per level and 
the estimation is terminated at level two (/ = 2). The reference image resembles the images in the last row of Figure 
3. Conventional algorithms relying on localized feature points would very likely produce unreliable results. 
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Figure 17: Results similar to Figure 16, but for a reference image with one eye closed and with severe distortions 
around the mouth. 
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